You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I manage a somewhat large and popular Hugging Face dataset known as the Open Australian Legal Corpus. I recently updated my corpus to be stored in a json lines file where each line is an array and each element represents a value at a particular column. Previously, my corpus was stored as a json lines file where each line was a dictionary and the keys were the fields.
Essentially, a line in my json lines file used to look like this:
This saves 65 bytes per document and allows me very quickly serialise and deserialise documents via msgspec.
After making this change, I found that datasets was incapable of deserialising my Corpus without a custom loading script, even if I ensured that the dataset_info field in my dataset card contained the desired names of my features.
I would like to request that functionality be added to support this format which is more memory-efficent and faster than using dictionaries.
Motivation
The documentation for creating dataset loading scripts asserts that:
In the next major release, the new safety features of 🤗 Datasets will disable running dataset loading scripts by default, and you will have to pass trust_remote_code=True to load datasets that require running a dataset script.
I would rather not require my users to pass trust_remote_code=True which means that I will need built-in support for this format.
Your contribution
I would be happy to submit a PR for this if this is something you would incorporate into datasets and if I can be pointed to where the code would need to go.
The text was updated successfully, but these errors were encountered:
Update: I ended up deciding to go back to use lines of dictionaries instead of arrays, not because of this issue as my users would be capable of downloading my corpus without datasets, but the speed and storage savings are not currently worth breaking my API and harming the backwards compatibility of each new revision.
With that said, for a static dataset that is not regularly updated like mine, and particularly for extremely large datasets with millions or billions of rows, using arrays could have a meaningful impact, and so there is probably still value in supporting this structure, provided the effort is not too much.
Feature request
I manage a somewhat large and popular Hugging Face dataset known as the Open Australian Legal Corpus. I recently updated my corpus to be stored in a json lines file where each line is an array and each element represents a value at a particular column. Previously, my corpus was stored as a json lines file where each line was a dictionary and the keys were the fields.
Essentially, a line in my json lines file used to look like this:
And now it looks like this:
This saves 65 bytes per document and allows me very quickly serialise and deserialise documents via
msgspec
.After making this change, I found that
datasets
was incapable of deserialising my Corpus without a custom loading script, even if I ensured that thedataset_info
field in my dataset card contained the desired names of my features.I would like to request that functionality be added to support this format which is more memory-efficent and faster than using dictionaries.
Motivation
The documentation for creating dataset loading scripts asserts that:
I would rather not require my users to pass
trust_remote_code=True
which means that I will need built-in support for this format.Your contribution
I would be happy to submit a PR for this if this is something you would incorporate into
datasets
and if I can be pointed to where the code would need to go.The text was updated successfully, but these errors were encountered: