-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration-based use of HF hub-hosted datasets for training #701
base: main
Are you sure you want to change the base?
Conversation
Motivated by need to reproduce #620 with an open dataset |
llms/mlx_lm/tuner/datasets.py
Outdated
@@ -99,6 +205,6 @@ def load_dataset(args, tokenizer: PreTrainedTokenizer): | |||
) | |||
if args.test and len(test) == 0: | |||
raise ValueError( | |||
"Test set not found or empty. Must provide test set for evaluation." | |||
"Test set not found or empty. Must provide test set for ev aluation." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Test set not found or empty. Must provide test set for ev aluation." | |
"Test set not found or empty. Must provide test set for evaluation." |
llms/mlx_lm/tuner/datasets.py
Outdated
https://huggingface.co/docs/datasets/en/access | ||
""" | ||
|
||
def __init__(self, hf_dataset, tokenizer, text_feature): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make text_feature
configurable in the regular dataset, default it to text
and then just reuse that class.
Same comment for the completions dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice, opens up a lot of datasets without too much complexity. I'd like to simplify the dataset classes (see inline comment). After that, I think we should merge it.
…pletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.
Per the title, allow a structured
hf_dataset
YAML configuration parameter for specifying an HF hub-hosted dataset (vianame
) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (viaconfiguration
), split slicing syntax for specifying train, validation, and test datasets, etc.The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via
text_feature
in that case) can be specified.Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):
See: Splits and Configurations, billsum, & HF Dataset API