Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 #256

Open
hongWin opened this issue Apr 12, 2024 · 5 comments

Comments

@hongWin
Copy link

hongWin commented Apr 12, 2024

04/12/2024 10:26:38 - INFO - dbgpt_hub.llm_base.adapter - Fine-tuning method: LoRA
04/12/2024 10:26:39 - INFO - dbgpt_hub.llm_base.load_tokenizer - trainable params: 15597568 || all params: 6259181568 || trainable%: 0.2492
Running tokenizer on dataset: 0%| | 0/8659 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\run_sft.py", line 79, in
start_sft(train_args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train_api.py", line 43, in start_sft
sft_train.train(args)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 144, in train
run_sft(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\train\sft_train.py", line 53, in run_sft
dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, "sft")
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 810, in preprocess_dataset
dataset = dataset.map(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 593, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 558, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3105, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3482, in _map_single
batch = apply_function_on_filtered_inputs(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\datasets\arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\data_process\data_utils.py", line 664, in preprocess_supervised_dataset
for source_ids, target_ids in template.encode_multiturn(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 270, in encode_multiturn
encoded_pairs = self._encode(tokenizer, system, history)
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 321, in _encode
prefix_ids = self._convert_inputs_to_ids(
File "D:\text2sql\DB-GPT-Hub-main\DB-GPT-Hub-main\dbgpt_hub\configs\data_args.py", line 368, in _convert_inputs_to_ids
token_ids = token_ids + tokenizer.encode(elem, **kwargs)
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 2600, in encode
encoded_inputs = self.encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3008, in encode_plus
return self._encode_plus(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils.py", line 722, in _encode_plus
return self.prepare_for_model(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3487, in prepare_for_model
encoded_inputs = self.pad(
File "D:\Anaconda3\envs\chatsql2\lib\site-packages\transformers\tokenization_utils_base.py", line 3292, in pad
encoded_inputs = self._pad(
File "C:\Users\PC.cache\huggingface\modules\transformers_modules\glm3_Parameter\tokenization_chatglm.py", line 271, in _pad
assert self.padding_side == "left"
AssertionError

@hongWin hongWin changed the title 我在加载数据集时,出现断言错误,请问如何解决?模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 我在加载数据集时,出现断言错误,请问如何解决,目前使用glm3模型?模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 Apr 12, 2024
@hongWin hongWin changed the title 我在加载数据集时,出现断言错误,请问如何解决,目前使用glm3模型?模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 Apr 12, 2024
@tomorrow-zy
Copy link

我也有相同的问题。

@Kudou-Chitose
Copy link

Kudou-Chitose commented May 22, 2024

@tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py, line 179
right改成left

@tomorrow-zy
Copy link

tomorrow-zy commented May 22, 2024 via email

@Kudou-Chitose
Copy link

这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系

---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年05月22日 09:29 | | 收件人 | eosphoros-ai/DB-GPT-Hub @.> | | 抄送人 | Zzzzz @.>, Mention @.> | | 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) | @tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py文件line 179,right改成left — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

看了下作者注释写了# training with left-padded tensors in fp16 precision may cause overflow
资源够的话可以试试fp32,或者对过长的序列max_length做截断

@tomorrow-zy
Copy link

这样训练之后推理的时候会出现 inf 的情况 不知道与这个有无关系

---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年05月22日 09:29 | | 收件人 | eosphoros-ai/DB-GPT-Hub _@**._> | | 抄送人 | Zzzzz _@.>, Mention @._> | | 主题 | Re: [eosphoros-ai/DB-GPT-Hub] 我在加载数据集时,出现断言错误,请问如何解决?目前使用glm3模型,模型已经导入,目前排查出错在语句dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, ",sft")后续无法排查。 (Issue #256) | @tomorrow-zy dbgpt_hub/llm_base/load_tokenizer.py文件line 179,right改成left — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: _@_.*>

看了下作者注释写了# training with left-padded tensors in fp16 precision may cause overflow 资源够的话可以试试fp32,或者对过长的序列max_length做截断

好的 感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants