Configuration-based use of HF hub-hosted datasets for training #701

chimezie · 2024-04-20T15:10:03Z

Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.

The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.

Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):

hf_dataset:
  name: "billsum"
  train_split: "train[:1000]"
  valid_split: "train[-100:]"
  prompt_feature: "text"
  completion_feature: "summary"

See: Splits and Configurations, billsum, & HF Dataset API

…LoRA training

chimezie · 2024-04-20T16:35:16Z

Motivated by need to reproduce #620 with an open dataset

awni · 2024-06-03T02:16:16Z

llms/mlx_lm/tuner/datasets.py

@@ -99,6 +205,6 @@ def load_dataset(args, tokenizer: PreTrainedTokenizer):
        )
    if args.test and len(test) == 0:
        raise ValueError(
-            "Test set not found or empty. Must provide test set for evaluation."
+            "Test set not found or empty. Must provide test set for ev aluation."


Suggested change

"Test set not found or empty. Must provide test set for ev aluation."

"Test set not found or empty. Must provide test set for evaluation."

awni · 2024-06-03T02:20:38Z

llms/mlx_lm/tuner/datasets.py

+    https://huggingface.co/docs/datasets/en/access
+    """
+
+    def __init__(self, hf_dataset, tokenizer, text_feature):


You can make text_feature configurable in the regular dataset, default it to text and then just reuse that class.

Same comment for the completions dataset.

awni

This is very nice, opens up a lot of datasets without too much complexity. I'd like to simplify the dataset classes (see inline comment). After that, I think we should merge it.

…pletionsDataset instead, adding a text_key constructor argument to the former (and changing it to work with a provided data structure instead of just from a JSON file), and prompt_key and completion_key arguments to the latter with defaults for backwards compatibility.

chimezie added 7 commits April 20, 2024 10:35

Add hf_dataset configuration for using HF hub-hosted datasets for (Q)…

7d3bdc9

…LoRA training

Pre-commit formatting

5bdb061

Merge branch 'ml-explore:main' into hf_datasets

be2271a

Fix YAML config example

3805cb0

Merge remote-tracking branch 'origin/hf_datasets' into hf_datasets

3b008d5

Print DS info

0db11ef

Include name

81fab48

chimezie changed the title ~~Support for configuration-based use of HF hub-hosted datasets for training~~ Configuration-based use of HF hub-hosted datasets for training Apr 21, 2024

chimezie added 8 commits April 22, 2024 10:24

Merge branch 'ml-explore:main' into hf_datasets

7483b50

Add hf_dataset parameter default

ce82b35

Merge remote-tracking branch 'origin/hf_datasets' into hf_datasets

3536b51

Merge branch 'ml-explore:main' into hf_datasets

5f18f58

Merge branch 'ml-explore:main' into hf_datasets

5ee28d1

Merge branch 'ml-explore:main' into hf_datasets

ec00033

Merge branch 'ml-explore:main' into hf_datasets

c6f0407

Merge branch 'ml-explore:main' into hf_datasets

22ff45a

awni reviewed Jun 3, 2024

View reviewed changes

chimezie added 2 commits June 7, 2024 21:10

Merge branch 'ml-explore:main' into hf_datasets

fa30cfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration-based use of HF hub-hosted datasets for training #701

Configuration-based use of HF hub-hosted datasets for training #701

chimezie commented Apr 20, 2024 •

edited

chimezie commented Apr 20, 2024

awni Jun 3, 2024

awni Jun 3, 2024

awni left a comment

	"Test set not found or empty. Must provide test set for ev aluation."
	"Test set not found or empty. Must provide test set for evaluation."

Configuration-based use of HF hub-hosted datasets for training #701

Are you sure you want to change the base?

Configuration-based use of HF hub-hosted datasets for training #701

Conversation

chimezie commented Apr 20, 2024 • edited

chimezie commented Apr 20, 2024

awni Jun 3, 2024

Choose a reason for hiding this comment

awni Jun 3, 2024

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

chimezie commented Apr 20, 2024 •

edited