Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction protocol for arrow files is not defined #6905

Open
radulescupetru opened this issue May 17, 2024 · 0 comments
Open

Extraction protocol for arrow files is not defined #6905

radulescupetru opened this issue May 17, 2024 · 0 comments

Comments

@radulescupetru
Copy link

Describe the bug

Passing files with .arrow extension into data_files argument, at least when streaming=True is very slow.

Steps to reproduce the bug

Basically it goes through the _get_extraction_protocol method located here
The method then looks at some base known extensions where arrow is not defined so it proceeds to determine the compression with the magic number method which is slow when dealing with a lot of files which are stored in s3 and by looking at this predefined list, I don't see arrow in there either so in the end it return None:

MAGIC_NUMBER_TO_COMPRESSION_PROTOCOL = {
    bytes.fromhex("504B0304"): "zip",
    bytes.fromhex("504B0506"): "zip",  # empty archive
    bytes.fromhex("504B0708"): "zip",  # spanned archive
    bytes.fromhex("425A68"): "bz2",
    bytes.fromhex("1F8B"): "gzip",
    bytes.fromhex("FD377A585A00"): "xz",
    bytes.fromhex("04224D18"): "lz4",
    bytes.fromhex("28B52FFD"): "zstd",
}

Expected behavior

My expectation is that arrow would be in the known lists so it would return None without going through the magic number method.

Environment info

datasets 2.19.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant