DPO format - Expected a string, got {}".format(value), got None #3555

Katehuuh · 2024-05-03T05:34:52Z

Reminder

I have read the README and searched the existing issues.

Reproduction

5e6f808 format DPO Dataset

name_dpo.json:

[
  {
    "instruction": "Last Question",
    "input": "",
    "output": [
      "Last Question, chosen answer",
      "Last Question, rejected answer"
    ],
    "history": [
      [
        "Hello",
        "Hello!"
      ],
      [
        "Describe ... ",
        "Answer2"
      ]
    ]
  },
...
]

dataset_info.json:

  "name_dpo": {
    "file_name": "name_dpo.json",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "history": "history"
    },
    "ranking": true
  },

Log

Generating train split: 305869 examples [00:14, 20697.01 examples/s]
Converting format of dataset: 100%|██████████████████████████████████| 100000/100000 [00:04<00:00, 23519.41 examples/s]
Running tokenizer on dataset:   7%|██▋                                   | 7000/100000 [00:29<06:34, 235.63 examples/s]
Traceback (most recent call last):
  File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\LLaMA-Factory\venv\Scripts\llamafactory-cli.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\cli.py", line 33, in main
    run_exp()
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\tuner.py", line 41, in run_exp
    run_orpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\orpo\workflow.py", line 29, in run_orpo
    dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\loader.py", line 164, in get_dataset
    dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3156, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3547, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\preprocess.py", line 242, in preprocess_pairwise_dataset
    _, rejected_ids = template.encode_oneturn(
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 45, in encode_oneturn
    encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 94, in _encode
    elements += self.format_assistant.apply(content=message["content"])
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\formatter.py", line 116, in apply
    raise RuntimeError("Expected a string, got {}".format(value))
RuntimeError: Expected a string, got None

I did the following:

Removed empty strings and added blanks, converted any value to a string, removed possible None values, and double-checked that only strings were left for every key.
I split the data into two halves, but it still consistently stops at 7% during the Running tokenizer process.
In the end, I created a workaround by modifying the preprocess_pairwise_dataset function in the preprocess.py fileto handles cases where message["content"] is None. If message["content"] is None, it is replaced with an empty string for both chosen_messages and rejected_messages.

preprocess_pairwise_dataset

def preprocess_pairwise_dataset(
    examples: Dict[str, List[Any]],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    data_args: "DataArguments",
) -> Dict[str, List[List[int]]]:
    # build input pairs with format `<bos> X`, `Y1 <eos>` and `Y2 <eos>`
    model_inputs = {"prompt_ids": [], "chosen_ids": [], "rejected_ids": []}
    if processor is not None:
        model_inputs["pixel_values"] = []
        preprocess_visual_inputs = partial(_preprocess_visual_inputs, processor=processor)

    for i in range(len(examples["prompt"])):
        if len(examples["prompt"][i]) % 2 != 1 or len(examples["response"][i]) < 2:
            continue

        if processor is not None:
            examples["prompt"][i][0]["content"] = "<image>" + examples["prompt"][i][0]["content"]

        chosen_messages = examples["prompt"][i] + [examples["response"][i][0]]
        rejected_messages = examples["prompt"][i] + [examples["response"][i][1]]

        # Check if message["content"] is None and replace it with an empty string/null
        for message in rejected_messages:
            if message["content"] is None:
                message["content"] = "null"

        prompt_ids, chosen_ids = template.encode_oneturn(
            tokenizer,
            chosen_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )
        _, rejected_ids = template.encode_oneturn(
            tokenizer,
            rejected_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )

        if template.efficient_eos:
            chosen_ids += [tokenizer.eos_token_id]
            rejected_ids += [tokenizer.eos_token_id]

        model_inputs["prompt_ids"].append(prompt_ids)
        model_inputs["chosen_ids"].append(chosen_ids)
        model_inputs["rejected_ids"].append(rejected_ids)
        if processor is not None:
            model_inputs["pixel_values"].append(preprocess_visual_inputs(examples["images"][i]))

    return model_inputs

A simple JSON check for LLaMA-Factory script would be useful, as I cannot identify line or format issue when use LLaMA-Factory dataset loader.

Edit1:
I'v printed all None. Only rejected_messages "content" cause issues, 1-2 was

    "output": [
      "Answer1 ....",
      null
    ],

300+ Others are math/code (e.g. !is_none) characters or "content" not part of my dataset...?!

The text was updated successfully, but these errors were encountered:

Katehuuh · 2024-05-07T04:06:55Z

Not fixed, but given my (complex) RLHF dataset of over 10GB+, it turns out that ORPO LoRa with a rank of 32 cannot handle it.

hiyouga added the pending This problem is yet to be addressed. label May 3, 2024

Katehuuh closed this as completed May 7, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 7, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO format - Expected a string, got {}".format(value), got None #3555

DPO format - Expected a string, got {}".format(value), got None #3555

Katehuuh commented May 3, 2024 •

edited

Katehuuh commented May 7, 2024

DPO format - Expected a string, got {}".format(value), got None #3555

DPO format - Expected a string, got {}".format(value), got None #3555

Comments

Katehuuh commented May 3, 2024 • edited

Reminder

Reproduction

Katehuuh commented May 7, 2024

Katehuuh commented May 3, 2024 •

edited