Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration-based use of HF hub-hosted datasets for training #701

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

chimezie
Copy link
Contributor

@chimezie chimezie commented Apr 20, 2024

Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.

The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.

Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):

hf_dataset:
  name: "billsum"
  train_split: "train[:1000]"
  valid_split: "train[-100:]"
  prompt_feature: "text"
  completion_feature: "summary"

See: Splits and Configurations, billsum, & HF Dataset API

@chimezie
Copy link
Contributor Author

Motivated by need to reproduce #620 with an open dataset

@chimezie chimezie changed the title Support for configuration-based use of HF hub-hosted datasets for training Configuration-based use of HF hub-hosted datasets for training Apr 21, 2024
@@ -99,6 +205,6 @@ def load_dataset(args, tokenizer: PreTrainedTokenizer):
)
if args.test and len(test) == 0:
raise ValueError(
"Test set not found or empty. Must provide test set for evaluation."
"Test set not found or empty. Must provide test set for ev aluation."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Test set not found or empty. Must provide test set for ev aluation."
"Test set not found or empty. Must provide test set for evaluation."

https://huggingface.co/docs/datasets/en/access
"""

def __init__(self, hf_dataset, tokenizer, text_feature):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make text_feature configurable in the regular dataset, default it to text and then just reuse that class.

Same comment for the completions dataset.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, opens up a lot of datasets without too much complexity. I'd like to simplify the dataset classes (see inline comment). After that, I think we should merge it.

…pletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants