Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调internvl-v1.5报错KeyError: 'input_ids' #951

Open
sunzx8 opened this issue May 17, 2024 · 7 comments
Open

微调internvl-v1.5报错KeyError: 'input_ids' #951

sunzx8 opened this issue May 17, 2024 · 7 comments

Comments

@sunzx8
Copy link

sunzx8 commented May 17, 2024

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
image

运行指令
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft --model_type internvl-chat-v1_5 --model_id_or_path /dev/shm/shawn/hf_ms_model/InternVL-Chat-V1-5 --dataset /dev/shm/shawn/data/ftoy.jsonl --sft_type full

数据格式为
{"query": "输出图片内容的markdown内容,如果有表格,则输出为html格式", "response": "```markdown\nAdaptive Quotient Filters\n\nConference '17, July 2017, Washington, DC, USA\n\n[34] Russell Housley, Warwick Ford, William Polk, and David Solo. 1999. Internet X.509 public key infrastructure certificate and CRL profile. Technical Report. M. Frans Kaashoek. 2002. The case for application-specific protocols. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP).", "images": ["/dev/shm/shawn/data/input/2405.10253v1/2405.10253v1-p16.png"]}

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
8*L20
image

Additional context
Add any other context about the problem here(在这里补充其他信息)

@sunzx8
Copy link
Author

sunzx8 commented May 17, 2024

我查了一下这个batch返回的是图片的两个元素,没有input_ids
image
image
image
请问这是什么原因?

@hjh0119
Copy link
Collaborator

hjh0119 commented May 20, 2024

八卡device map可能会有问题,试下2/4卡

@sunzx8
Copy link
Author

sunzx8 commented May 20, 2024

您好,我这里测出来是max_length的问题,请问为什么我设置max_length从2048到4096过后就会报错
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@sunzx8
Copy link
Author

sunzx8 commented May 20, 2024

还有我想问一下如果需要16卡两台机器一起微调需要怎么设置?

@hjh0119
Copy link
Collaborator

hjh0119 commented May 20, 2024

CUDA报错,可能是OOM或者CUDA环境问题

多机多卡readme里有样例

@sunzx8
Copy link
Author

sunzx8 commented May 20, 2024

CUDA报错,可能是OOM或者CUDA环境问题

多机多卡readme里有样例

还有个问题,我发现用您给的lora微调方式虽然param显示只训练了很少的参数,但是显存消耗和全参数一模一样,请问这是不是实际没有转换过来?
image

实际消耗显存和全参数微调coco-mini的一样是241gb

@sunzx8
Copy link
Author

sunzx8 commented May 20, 2024

CUDA报错,可能是OOM或者CUDA环境问题
多机多卡readme里有样例

还有个问题,我发现用您给的lora微调方式虽然param显示只训练了很少的参数,但是显存消耗和全参数一模一样,请问这是不是实际没有转换过来? image
实际消耗显存和全参数微调coco-mini的一样是241gb

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft --model_type internvl-chat-v1_5 --model_id_or_path /dev/shm/shawn/hf_ms_model/InternVL-Chat-V1-5 --dataset coco-mini-en-2 --sft_type lora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants