You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generating train split: 305869 examples [00:14, 20697.01 examples/s]
Converting format of dataset: 100%|██████████████████████████████████| 100000/100000 [00:04<00:00, 23519.41 examples/s]
Running tokenizer on dataset: 7%|██▋ | 7000/100000 [00:29<06:34, 235.63 examples/s]
Traceback (most recent call last):
File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\LLaMA-Factory\venv\Scripts\llamafactory-cli.exe\__main__.py", line 7, in <module>
sys.exit(main())
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\cli.py", line 33, in main
run_exp()
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\tuner.py", line 41, in run_exp
run_orpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\orpo\workflow.py", line 29, in run_orpo
dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\loader.py", line 164, in get_dataset
dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\preprocess.py", line 242, in preprocess_pairwise_dataset
_, rejected_ids = template.encode_oneturn(
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 45, in encode_oneturn
encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 94, in _encode
elements += self.format_assistant.apply(content=message["content"])
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\formatter.py", line 116, in apply
raise RuntimeError("Expected a string, got {}".format(value))
RuntimeError: Expected a string, got None
I did the following:
Removed empty strings and added blanks, converted any value to a string, removed possible None values, and double-checked that only strings were left for every key.
I split the data into two halves, but it still consistently stops at 7% during the Running tokenizer process.
In the end, I created a workaround by modifying the preprocess_pairwise_dataset function in the preprocess.py fileto handles cases where message["content"] is None. If message["content"] is None, it is replaced with an empty string for both chosen_messages and rejected_messages.
Reminder
Reproduction
5e6f808 format DPO Dataset
name_dpo.json
:dataset_info.json
:Log
None
values, and double-checked that only strings were left for every key.7%
during theRunning tokenizer
process.In the end, I created a workaround by modifying the
preprocess_pairwise_dataset
function in thepreprocess.py
fileto handles cases wheremessage["content"]
isNone
. Ifmessage["content"]
isNone
, it is replaced with an empty string for bothchosen_messages
andrejected_messages
.preprocess_pairwise_dataset
A simple JSON check for
LLaMA-Factory
script would be useful, as I cannot identify line or format issue when useLLaMA-Factory
dataset loader.Edit1:
I'v printed all
None
. Onlyrejected_messages
"content"
cause issues, 1-2 was300+ Others are math/code (e.g.
!is_none
) characters or"content"
not part of my dataset...?!The text was updated successfully, but these errors were encountered: