Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the deserialization of json lines files comprised of lists #6907

Open
umarbutler opened this issue May 18, 2024 · 1 comment
Open
Assignees
Labels
enhancement New feature or request

Comments

@umarbutler
Copy link

Feature request

I manage a somewhat large and popular Hugging Face dataset known as the Open Australian Legal Corpus. I recently updated my corpus to be stored in a json lines file where each line is an array and each element represents a value at a particular column. Previously, my corpus was stored as a json lines file where each line was a dictionary and the keys were the fields.

Essentially, a line in my json lines file used to look like this:

{"version_id":"","type":"","jurisdiction":"","source":"","citation":"","url":"","when_scraped":"","text":""}

And now it looks like this:

["","","","","","","",""]

This saves 65 bytes per document and allows me very quickly serialise and deserialise documents via msgspec.

After making this change, I found that datasets was incapable of deserialising my Corpus without a custom loading script, even if I ensured that the dataset_info field in my dataset card contained the desired names of my features.

I would like to request that functionality be added to support this format which is more memory-efficent and faster than using dictionaries.

Motivation

The documentation for creating dataset loading scripts asserts that:

In the next major release, the new safety features of 🤗 Datasets will disable running dataset loading scripts by default, and you will have to pass trust_remote_code=True to load datasets that require running a dataset script.

I would rather not require my users to pass trust_remote_code=True which means that I will need built-in support for this format.

Your contribution

I would be happy to submit a PR for this if this is something you would incorporate into datasets and if I can be pointed to where the code would need to go.

@umarbutler umarbutler added the enhancement New feature or request label May 18, 2024
@albertvillanova albertvillanova self-assigned this May 18, 2024
@umarbutler
Copy link
Author

Update: I ended up deciding to go back to use lines of dictionaries instead of arrays, not because of this issue as my users would be capable of downloading my corpus without datasets, but the speed and storage savings are not currently worth breaking my API and harming the backwards compatibility of each new revision.

With that said, for a static dataset that is not regularly updated like mine, and particularly for extremely large datasets with millions or billions of rows, using arrays could have a meaningful impact, and so there is probably still value in supporting this structure, provided the effort is not too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants