Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO format - Expected a string, got {}".format(value), got None #3555

Closed
1 task done
Katehuuh opened this issue May 3, 2024 · 1 comment
Closed
1 task done

DPO format - Expected a string, got {}".format(value), got None #3555

Katehuuh opened this issue May 3, 2024 · 1 comment
Labels
wontfix This will not be worked on

Comments

@Katehuuh
Copy link
Contributor

Katehuuh commented May 3, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

5e6f808 format DPO Dataset

name_dpo.json:

[
  {
    "instruction": "Last Question",
    "input": "",
    "output": [
      "Last Question, chosen answer",
      "Last Question, rejected answer"
    ],
    "history": [
      [
        "Hello",
        "Hello!"
      ],
      [
        "Describe ... ",
        "Answer2"
      ]
    ]
  },
...
]

dataset_info.json:

  "name_dpo": {
    "file_name": "name_dpo.json",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "history": "history"
    },
    "ranking": true
  },
Log
Generating train split: 305869 examples [00:14, 20697.01 examples/s]
Converting format of dataset: 100%|██████████████████████████████████| 100000/100000 [00:04<00:00, 23519.41 examples/s]
Running tokenizer on dataset:   7%|██▋                                   | 7000/100000 [00:29<06:34, 235.63 examples/s]
Traceback (most recent call last):
  File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\LLaMA-Factory\venv\Scripts\llamafactory-cli.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\cli.py", line 33, in main
    run_exp()
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\tuner.py", line 41, in run_exp
    run_orpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\orpo\workflow.py", line 29, in run_orpo
    dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\loader.py", line 164, in get_dataset
    dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3156, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3547, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\preprocess.py", line 242, in preprocess_pairwise_dataset
    _, rejected_ids = template.encode_oneturn(
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 45, in encode_oneturn
    encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 94, in _encode
    elements += self.format_assistant.apply(content=message["content"])
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\formatter.py", line 116, in apply
    raise RuntimeError("Expected a string, got {}".format(value))
RuntimeError: Expected a string, got None
I did the following:
  • Removed empty strings and added blanks, converted any value to a string, removed possible None values, and double-checked that only strings were left for every key.
  • I split the data into two halves, but it still consistently stops at 7% during the Running tokenizer process.
    In the end, I created a workaround by modifying the preprocess_pairwise_dataset function in the preprocess.py fileto handles cases where message["content"] is None. If message["content"] is None, it is replaced with an empty string for both chosen_messages and rejected_messages.
preprocess_pairwise_dataset
def preprocess_pairwise_dataset(
    examples: Dict[str, List[Any]],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    data_args: "DataArguments",
) -> Dict[str, List[List[int]]]:
    # build input pairs with format `<bos> X`, `Y1 <eos>` and `Y2 <eos>`
    model_inputs = {"prompt_ids": [], "chosen_ids": [], "rejected_ids": []}
    if processor is not None:
        model_inputs["pixel_values"] = []
        preprocess_visual_inputs = partial(_preprocess_visual_inputs, processor=processor)

    for i in range(len(examples["prompt"])):
        if len(examples["prompt"][i]) % 2 != 1 or len(examples["response"][i]) < 2:
            continue

        if processor is not None:
            examples["prompt"][i][0]["content"] = "<image>" + examples["prompt"][i][0]["content"]

        chosen_messages = examples["prompt"][i] + [examples["response"][i][0]]
        rejected_messages = examples["prompt"][i] + [examples["response"][i][1]]

        # Check if message["content"] is None and replace it with an empty string/null
        for message in rejected_messages:
            if message["content"] is None:
                message["content"] = "null"

        prompt_ids, chosen_ids = template.encode_oneturn(
            tokenizer,
            chosen_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )
        _, rejected_ids = template.encode_oneturn(
            tokenizer,
            rejected_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )

        if template.efficient_eos:
            chosen_ids += [tokenizer.eos_token_id]
            rejected_ids += [tokenizer.eos_token_id]

        model_inputs["prompt_ids"].append(prompt_ids)
        model_inputs["chosen_ids"].append(chosen_ids)
        model_inputs["rejected_ids"].append(rejected_ids)
        if processor is not None:
            model_inputs["pixel_values"].append(preprocess_visual_inputs(examples["images"][i]))

    return model_inputs

A simple JSON check for LLaMA-Factory script would be useful, as I cannot identify line or format issue when use LLaMA-Factory dataset loader.

Edit1:
I'v printed all None. Only rejected_messages "content" cause issues, 1-2 was

    "output": [
      "Answer1 ....",
      null
    ],

300+ Others are math/code (e.g. !is_none) characters or "content" not part of my dataset...?!

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 3, 2024
@Katehuuh
Copy link
Contributor Author

Katehuuh commented May 7, 2024

Not fixed, but given my (complex) RLHF dataset of over 10GB+, it turns out that ORPO LoRa with a rank of 32 cannot handle it.

@Katehuuh Katehuuh closed this as completed May 7, 2024
@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 7, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants