Skip to content

fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) #7521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

giraffacarp
Copy link

@giraffacarp giraffacarp commented Apr 15, 2025

Task

Support bytes-like objects (bytes and bytearray) in Features classes

Description

The Features classes only accept bytes objects for binary data, but not bytearray. This leads to errors when using IterableDataset.from_spark() with Spark DataFrames as they contain bytearray objects, even though both bytes and bytearray are valid bytes-like objects in Python.

Changes

  • Updated Features classes to accept both bytes and bytearray types for binary data fields.

Reasoning

  • bytes and bytearray serve the same purpose for binary data, with the only difference being mutability.
  • Modifying the Spark iterator to convert bytearray to bytes would be a workaround, not a true fix. I think the correct solution is to accept all bytes-like objects as input.
  • This approach is more robust and future-proof since Python 3.12+ provides a standard way to check for buffer protocol.

Testing

  • Added tests to cover bytearray inputs for image features.

Related Issues

@giraffacarp
Copy link
Author

@lhoestq let me know if you prefer to change the spark iterator so it outputs bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames
1 participant