Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bge-reranker-v2-minicpm-layerwise微调loss为1的问题 #792

Open
sevenandseven opened this issue May 16, 2024 · 9 comments
Open

bge-reranker-v2-minicpm-layerwise微调loss为1的问题 #792

sevenandseven opened this issue May 16, 2024 · 9 comments

Comments

@sevenandseven
Copy link

CUDA_VISIBLE_DEVICES=6,7 torchrun --nproc_per_node 2
-m FlagEmbedding.llm_reranker.finetune_for_layerwise.run
--output_dir ./results/reranker/bge-reranker-v2-minicpm-layerwise
--model_name_or_path /media/ai/HDD/Teamwork/LLM_Embedding_model/Embedding/Embedding/bge-reranker-v2-minicpm-layerwise
--train_data /media/ai/HDD/Teamwork/wangenzhi/FlagEmbedding-master/official/FlagEmbedding/fine_data/layer_reranker.jsonl
--learning_rate 6e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 2
--gradient_accumulation_steps 4
--dataloader_drop_last True
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 10
--save_total_limit 10
--warmup_ratio 0.1
--use_lora True
--lora_rank 32
--lora_alpha 64
--use_flash_attn False
--target_modules q_proj k_proj v_proj o_proj
--start_layer 8
--head_multi True
--head_type simple
--lora_extra_parameters linear_head

When using the above command for fine-tuning, if the loss value becomes 'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, '
, you can try the following methods to resolve the issue:

@545999961
Copy link
Collaborator

是一开始就为0吗,还是训练的过程中变为0的
训练的过程中变为0的话可以尝试将fp16换为bf16

@sevenandseven
Copy link
Author

sevenandseven commented May 17, 2024

是一开始就为0吗,还是训练的过程中变为0的 训练的过程中变为0的话可以尝试将fp16换为bf16

一开始就为0,我将学习率调整为2e-7之后,loss开始变得很大,从几百多开始逐渐下降。我将学习率调整为2e-4同样出现以下情况。
warnings.warn(
{'loss': 514.8239, 'grad_norm': 538.07861328125, 'learning_rate': 0.00019964850615114237, 'epoch': 0.0}
{'loss': 466.7511, 'grad_norm': 598.7839965820312, 'learning_rate': 0.00019929701230228473, 'epoch': 0.0}
{'loss': 321.3008, 'grad_norm': 395.62548828125, 'learning_rate': 0.0001989455184534271, 'epoch': 0.01}
{'loss': 266.1515, 'grad_norm': 460.9885559082031, 'learning_rate': 0.00019859402460456943, 'epoch': 0.01}
{'loss': 234.468, 'grad_norm': 616.3043212890625, 'learning_rate': 0.00019824253075571176, 'epoch': 0.01}
{'loss': 261.4581, 'grad_norm': 646.939208984375, 'learning_rate': 0.00019789103690685413, 'epoch': 0.01}
{'loss': 198.6534, 'grad_norm': 580.589599609375, 'learning_rate': 0.0001975395430579965, 'epoch': 0.01}
{'loss': 174.0701, 'grad_norm': 360.647216796875, 'learning_rate': 0.00019718804920913885, 'epoch': 0.01}
{'loss': 187.0631, 'grad_norm': 375.31640625, 'learning_rate': 0.0001968365553602812, 'epoch': 0.02}
{'loss': 155.5526, 'grad_norm': 310.011474609375, 'learning_rate': 0.00019648506151142357, 'epoch': 0.02}

@545999961
Copy link
Collaborator

训练的时候加上参数--finetune_type from_raw_model from_finetuned_model
学习率应该是2e-4的,设置--deepspeed stage1.json看看是否还会出现上述问题

@sevenandseven
Copy link
Author

训练的时候加上参数--finetune_type from_raw_model from_finetuned_model 学习率应该是2e-4的,设置--deepspeed stage1.json看看是否还会出现上述问题

我添加了这个参数还是同样的问题,损失还是很大。

@545999961
Copy link
Collaborator

训练的时候加上参数--finetune_type from_raw_model from_finetuned_model 学习率应该是2e-4的,设置--deepspeed stage1.json看看是否还会出现上述问题

我添加了这个参数还是同样的问题,损失还是很大。

还是从500多开始下降吗

@sevenandseven
Copy link
Author

训练的时候加上参数--finetune_type from_raw_model from_finetuned_model 学习率应该是2e-4的,设置--deepspeed stage1.json看看是否还会出现上述问题

我添加了这个参数还是同样的问题,损失还是很大。

还是从500多开始下降吗
是的。--finetune_type from_raw_model from_finetuned_model 这个参数是2选一吗?同时输入会出现以下错误。

File "/media/ai/HDD/Teamwork/wangenzhi/FlagEmbedding-master/official/FlagEmbedding/FlagEmbedding/llm_reranker/finetune_for_layerwise/run.py", line 23, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/ai/anaconda3/envs/langchain_chatchat/lib/python3.10/site-packages/transformers/hf_argparser.py", line 347, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['from_finetuned_model']
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 54498) of binary: /home/ai/anaconda3/envs/langchain_chatchat/bin/python

@545999961
Copy link
Collaborator

是2选一,from_finetuned_model 就可以,最后loss是所有层loss的累加,所以会显得比较大
可以对比从reranker开始微调和直接从原始模型开始微调的loss差距。如果从微调好的模型开始的loss比直接从原始模型开始的loss小很多,那就说明没有问题

@sevenandseven
Copy link
Author

是2选一,from_finetuned_model 就可以,最后loss是所有层loss的累加,所以会显得比较大 可以对比从reranker开始微调和直接从原始模型开始微调的loss差距。如果从微调好的模型开始的loss比直接从原始模型开始的loss小很多,那就说明没有问题

你好,这个是我的从原始模型微调的loss变化:
warnings.warn(
{'loss': 514.8239, 'grad_norm': 538.07861328125, 'learning_rate': 0.00019964850615114237, 'epoch': 0.0}
{'loss': 466.7511, 'grad_norm': 598.7839965820312, 'learning_rate': 0.00019929701230228473, 'epoch': 0.0}
{'loss': 321.3008, 'grad_norm': 395.62548828125, 'learning_rate': 0.0001989455184534271, 'epoch': 0.01}
{'loss': 266.1515, 'grad_norm': 460.9885559082031, 'learning_rate': 0.00019859402460456943, 'epoch': 0.01}
{'loss': 234.468, 'grad_norm': 616.3043212890625, 'learning_rate': 0.00019824253075571176, 'epoch': 0.01}
{'loss': 261.4581, 'grad_norm': 646.939208984375, 'learning_rate': 0.00019789103690685413, 'epoch': 0.01}
{'loss': 198.6534, 'grad_norm': 580.589599609375, 'learning_rate': 0.0001975395430579965, 'epoch': 0.01}
{'loss': 174.0701, 'grad_norm': 360.647216796875, 'learning_rate': 0.00019718804920913885, 'epoch': 0.01}
{'loss': 187.0631, 'grad_norm': 375.31640625, 'learning_rate': 0.0001968365553602812, 'epoch': 0.02}
{'loss': 155.5526, 'grad_norm': 310.011474609375, 'learning_rate': 0.00019648506151142357, 'epoch': 0.02}
{'loss': 163.7957, 'grad_norm': 330.1082763671875, 'learning_rate': 0.0001961335676625659, 'epoch': 0.02}
{'loss': 134.966, 'grad_norm': 206.18638610839844, 'learning_rate': 0.00019578207381370827, 'epoch': 0.02}
{'loss': 112.1916, 'grad_norm': 181.63343811035156, 'learning_rate': 0.00019543057996485063, 'epoch': 0.02}
{'loss': 107.8773, 'grad_norm': 230.8634796142578, 'learning_rate': 0.00019507908611599297, 'epoch': 0.02}
{'loss': 109.3568, 'grad_norm': 197.41690063476562, 'learning_rate': 0.00019472759226713533, 'epoch': 0.03}
{'loss': 106.0037, 'grad_norm': 173.08645629882812, 'learning_rate': 0.0001943760984182777, 'epoch': 0.03}
{'loss': 89.0028, 'grad_norm': 117.9629898071289, 'learning_rate': 0.00019402460456942005, 'epoch': 0.03}
{'loss': 71.7148, 'grad_norm': 102.20894622802734, 'learning_rate': 0.0001936731107205624, 'epoch': 0.03}
{'loss': 77.9973, 'grad_norm': 114.82682037353516, 'learning_rate': 0.00019332161687170475, 'epoch': 0.03}
{'loss': 73.5833, 'grad_norm': 81.8402328491211, 'learning_rate': 0.0001929701230228471, 'epoch': 0.04}
{'loss': 72.5585, 'grad_norm': 98.48847198486328, 'learning_rate': 0.00019261862917398947, 'epoch': 0.04}
{'loss': 59.0707, 'grad_norm': 86.48011016845703, 'learning_rate': 0.0001922671353251318, 'epoch': 0.04}
{'loss': 62.4839, 'grad_norm': 79.3683090209961, 'learning_rate': 0.00019191564147627417, 'epoch': 0.04}
{'loss': 59.2522, 'grad_norm': 57.751861572265625, 'learning_rate': 0.00019156414762741653, 'epoch': 0.04}
{'loss': 44.6661, 'grad_norm': 58.833187103271484, 'learning_rate': 0.0001912126537785589, 'epoch': 0.04}
{'loss': 40.7933, 'grad_norm': 42.0254020690918, 'learning_rate': 0.00019086115992970125, 'epoch': 0.05}
{'loss': 50.1218, 'grad_norm': 45.612060546875, 'learning_rate': 0.00019050966608084359, 'epoch': 0.05}
{'loss': 42.3546, 'grad_norm': 34.3447151184082, 'learning_rate': 0.00019015817223198595, 'epoch': 0.05}
{'loss': 50.2781, 'grad_norm': 44.68593215942383, 'learning_rate': 0.0001898066783831283, 'epoch': 0.05}
{'loss': 46.8812, 'grad_norm': 53.49127960205078, 'learning_rate': 0.00018945518453427067, 'epoch': 0.05}
{'loss': 51.3008, 'grad_norm': 49.961978912353516, 'learning_rate': 0.000189103690685413, 'epoch': 0.05}
{'loss': 39.5842, 'grad_norm': 41.7624626159668, 'learning_rate': 0.00018875219683655537, 'epoch': 0.06}
{'loss': 33.7966, 'grad_norm': 25.9057674407959, 'learning_rate': 0.00018840070298769773, 'epoch': 0.06}
{'loss': 28.1447, 'grad_norm': 15.165142059326172, 'learning_rate': 0.00018804920913884006, 'epoch': 0.06}

这个是我从微调模型开始的loss变化:
{'loss': 57.9339, 'grad_norm': 155.48330688476562, 'learning_rate': 0.00019964850615114237, 'epoch': 0.0}
{'loss': 62.5916, 'grad_norm': 147.96661376953125, 'learning_rate': 0.00019929701230228473, 'epoch': 0.0}
{'loss': 88.7841, 'grad_norm': 180.3634490966797, 'learning_rate': 0.0001989455184534271, 'epoch': 0.01}
{'loss': 31.9065, 'grad_norm': 44.84025573730469, 'learning_rate': 0.00019859402460456943, 'epoch': 0.01}
{'loss': 33.4273, 'grad_norm': 46.82331848144531, 'learning_rate': 0.00019824253075571176, 'epoch': 0.01}
{'loss': 42.595, 'grad_norm': 60.91260528564453, 'learning_rate': 0.00019789103690685413, 'epoch': 0.01}
{'loss': 30.1122, 'grad_norm': 29.61505889892578, 'learning_rate': 0.0001975395430579965, 'epoch': 0.01}
{'loss': 33.3277, 'grad_norm': 31.189620971679688, 'learning_rate': 0.00019718804920913885, 'epoch': 0.01}
{'loss': 59.5543, 'grad_norm': 72.22501373291016, 'learning_rate': 0.0001968365553602812, 'epoch': 0.02}
{'loss': 49.1377, 'grad_norm': 60.78026580810547, 'learning_rate': 0.00019648506151142357, 'epoch': 0.02}
{'loss': 16.8538, 'grad_norm': 16.02872085571289, 'learning_rate': 0.0001961335676625659, 'epoch': 0.02}
{'loss': 51.4577, 'grad_norm': 49.48100662231445, 'learning_rate': 0.00019578207381370827, 'epoch': 0.02}
{'loss': 48.7724, 'grad_norm': 42.62297439575195, 'learning_rate': 0.00019543057996485063, 'epoch': 0.02}

两者能有10倍的差距,这个正常吗?
原始模型指的是预训练的大语言模型吗?微调模型指的是基于大预言模型微调的层reranker模型吗?

@545999961
Copy link
Collaborator

是2选一,from_finetuned_model 就可以,最后loss是所有层loss的累加,所以会显得比较大 可以对比从reranker开始微调和直接从原始模型开始微调的loss差距。如果从微调好的模型开始的loss比直接从原始模型开始的loss小很多,那就说明没有问题

你好,这个是我的从原始模型微调的loss变化: warnings.warn( {'loss': 514.8239, 'grad_norm': 538.07861328125, 'learning_rate': 0.00019964850615114237, 'epoch': 0.0} {'loss': 466.7511, 'grad_norm': 598.7839965820312, 'learning_rate': 0.00019929701230228473, 'epoch': 0.0} {'loss': 321.3008, 'grad_norm': 395.62548828125, 'learning_rate': 0.0001989455184534271, 'epoch': 0.01} {'loss': 266.1515, 'grad_norm': 460.9885559082031, 'learning_rate': 0.00019859402460456943, 'epoch': 0.01} {'loss': 234.468, 'grad_norm': 616.3043212890625, 'learning_rate': 0.00019824253075571176, 'epoch': 0.01} {'loss': 261.4581, 'grad_norm': 646.939208984375, 'learning_rate': 0.00019789103690685413, 'epoch': 0.01} {'loss': 198.6534, 'grad_norm': 580.589599609375, 'learning_rate': 0.0001975395430579965, 'epoch': 0.01} {'loss': 174.0701, 'grad_norm': 360.647216796875, 'learning_rate': 0.00019718804920913885, 'epoch': 0.01} {'loss': 187.0631, 'grad_norm': 375.31640625, 'learning_rate': 0.0001968365553602812, 'epoch': 0.02} {'loss': 155.5526, 'grad_norm': 310.011474609375, 'learning_rate': 0.00019648506151142357, 'epoch': 0.02} {'loss': 163.7957, 'grad_norm': 330.1082763671875, 'learning_rate': 0.0001961335676625659, 'epoch': 0.02} {'loss': 134.966, 'grad_norm': 206.18638610839844, 'learning_rate': 0.00019578207381370827, 'epoch': 0.02} {'loss': 112.1916, 'grad_norm': 181.63343811035156, 'learning_rate': 0.00019543057996485063, 'epoch': 0.02} {'loss': 107.8773, 'grad_norm': 230.8634796142578, 'learning_rate': 0.00019507908611599297, 'epoch': 0.02} {'loss': 109.3568, 'grad_norm': 197.41690063476562, 'learning_rate': 0.00019472759226713533, 'epoch': 0.03} {'loss': 106.0037, 'grad_norm': 173.08645629882812, 'learning_rate': 0.0001943760984182777, 'epoch': 0.03} {'loss': 89.0028, 'grad_norm': 117.9629898071289, 'learning_rate': 0.00019402460456942005, 'epoch': 0.03} {'loss': 71.7148, 'grad_norm': 102.20894622802734, 'learning_rate': 0.0001936731107205624, 'epoch': 0.03} {'loss': 77.9973, 'grad_norm': 114.82682037353516, 'learning_rate': 0.00019332161687170475, 'epoch': 0.03} {'loss': 73.5833, 'grad_norm': 81.8402328491211, 'learning_rate': 0.0001929701230228471, 'epoch': 0.04} {'loss': 72.5585, 'grad_norm': 98.48847198486328, 'learning_rate': 0.00019261862917398947, 'epoch': 0.04} {'loss': 59.0707, 'grad_norm': 86.48011016845703, 'learning_rate': 0.0001922671353251318, 'epoch': 0.04} {'loss': 62.4839, 'grad_norm': 79.3683090209961, 'learning_rate': 0.00019191564147627417, 'epoch': 0.04} {'loss': 59.2522, 'grad_norm': 57.751861572265625, 'learning_rate': 0.00019156414762741653, 'epoch': 0.04} {'loss': 44.6661, 'grad_norm': 58.833187103271484, 'learning_rate': 0.0001912126537785589, 'epoch': 0.04} {'loss': 40.7933, 'grad_norm': 42.0254020690918, 'learning_rate': 0.00019086115992970125, 'epoch': 0.05} {'loss': 50.1218, 'grad_norm': 45.612060546875, 'learning_rate': 0.00019050966608084359, 'epoch': 0.05} {'loss': 42.3546, 'grad_norm': 34.3447151184082, 'learning_rate': 0.00019015817223198595, 'epoch': 0.05} {'loss': 50.2781, 'grad_norm': 44.68593215942383, 'learning_rate': 0.0001898066783831283, 'epoch': 0.05} {'loss': 46.8812, 'grad_norm': 53.49127960205078, 'learning_rate': 0.00018945518453427067, 'epoch': 0.05} {'loss': 51.3008, 'grad_norm': 49.961978912353516, 'learning_rate': 0.000189103690685413, 'epoch': 0.05} {'loss': 39.5842, 'grad_norm': 41.7624626159668, 'learning_rate': 0.00018875219683655537, 'epoch': 0.06} {'loss': 33.7966, 'grad_norm': 25.9057674407959, 'learning_rate': 0.00018840070298769773, 'epoch': 0.06} {'loss': 28.1447, 'grad_norm': 15.165142059326172, 'learning_rate': 0.00018804920913884006, 'epoch': 0.06}

这个是我从微调模型开始的loss变化: {'loss': 57.9339, 'grad_norm': 155.48330688476562, 'learning_rate': 0.00019964850615114237, 'epoch': 0.0} {'loss': 62.5916, 'grad_norm': 147.96661376953125, 'learning_rate': 0.00019929701230228473, 'epoch': 0.0} {'loss': 88.7841, 'grad_norm': 180.3634490966797, 'learning_rate': 0.0001989455184534271, 'epoch': 0.01} {'loss': 31.9065, 'grad_norm': 44.84025573730469, 'learning_rate': 0.00019859402460456943, 'epoch': 0.01} {'loss': 33.4273, 'grad_norm': 46.82331848144531, 'learning_rate': 0.00019824253075571176, 'epoch': 0.01} {'loss': 42.595, 'grad_norm': 60.91260528564453, 'learning_rate': 0.00019789103690685413, 'epoch': 0.01} {'loss': 30.1122, 'grad_norm': 29.61505889892578, 'learning_rate': 0.0001975395430579965, 'epoch': 0.01} {'loss': 33.3277, 'grad_norm': 31.189620971679688, 'learning_rate': 0.00019718804920913885, 'epoch': 0.01} {'loss': 59.5543, 'grad_norm': 72.22501373291016, 'learning_rate': 0.0001968365553602812, 'epoch': 0.02} {'loss': 49.1377, 'grad_norm': 60.78026580810547, 'learning_rate': 0.00019648506151142357, 'epoch': 0.02} {'loss': 16.8538, 'grad_norm': 16.02872085571289, 'learning_rate': 0.0001961335676625659, 'epoch': 0.02} {'loss': 51.4577, 'grad_norm': 49.48100662231445, 'learning_rate': 0.00019578207381370827, 'epoch': 0.02} {'loss': 48.7724, 'grad_norm': 42.62297439575195, 'learning_rate': 0.00019543057996485063, 'epoch': 0.02}

两者能有10倍的差距,这个正常吗? 原始模型指的是预训练的大语言模型吗?微调模型指的是基于大预言模型微调的层reranker模型吗?

是的
这个是正常的,从第二个开始继续微调就可以了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants