Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多节点训练报错 #945

Open
zhangfan-algo opened this issue May 16, 2024 · 0 comments
Open

多节点训练报错 #945

zhangfan-algo opened this issue May 16, 2024 · 0 comments

Comments

@zhangfan-algo
Copy link

Describe the bug
2024-05-16 14:19:20
[W socket.cpp:697] [c10d] The IPv6 network addresses of (zf-yi1-5-34b-sft-0516-02-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known).
2024-05-16 14:19:35
Traceback (most recent call last):
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/study_info/swift_0516/examples/pytorch/llm/llm_sft.py", line 2, in
2024-05-16 14:19:35
import custom
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/study_info/swift_0516/examples/pytorch/llm/custom.py", line 5, in
2024-05-16 14:19:35
from modelscope import AutoConfig, AutoModelForCausalLM, AutoTokenizer, MsDataset
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/init.py", line 4, in
2024-05-16 14:19:35
from modelscope.utils.import_utils import LazyImportModule
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/init.py", line 1, in
2024-05-16 14:19:35
from .hub import create_model_if_not_exist, read_config
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/hub.py", line 12, in
2024-05-16 14:19:35
from modelscope.utils.config import Config
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/config.py", line 19, in
2024-05-16 14:19:35
from yapf.yapflib.yapf_api import FormatCode
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/init.py", line 41, in
2024-05-16 14:19:35
from yapf.yapflib import yapf_api
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/yapflib/yapf_api.py", line 38, in
2024-05-16 14:19:35
from yapf.pyparser import pyparser
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/pyparser/pyparser.py", line 44, in
2024-05-16 14:19:35
from yapf.yapflib import format_token
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/yapflib/format_token.py", line 23, in
2024-05-16 14:19:35
from yapf.pytree import pytree_utils
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/pytree/pytree_utils.py", line 30, in
2024-05-16 14:19:35
from yapf_third_party._ylib2to3 import pygram
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in
2024-05-16 14:19:35
pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 252, in load_grammar
2024-05-16 14:19:35
g.load(gp)
2024-05-16 14:19:35
File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line 95, in load
2024-05-16 14:19:35
d = pickle.load(f)
2024-05-16 14:19:35
EOFError: Ran out of input

Your hardware and system info

torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE examples/pytorch/llm/llm_sft.py
--model_cache_dir /mnt/pfs/zhangfan/models/01-ai/Yi-1.5-34B-Chat
--model_type yi-1_5-34b-chat
--sft_type full
--tuner_backend swift
--template_type AUTO
--output_dir output/test
--ddp_backend nccl
--custom_train_dataset_path train_classfiy.jsonl
--dataset_test_ratio 0.03
--self_cognition_sample -1
--preprocess_num_proc 60
--dataloader_num_workers 60
--train_dataset_sample -1
--dataset_test_ratio 0.01
--lr_scheduler_type cosine
--num_train_epochs 5
--save_total_limit 10
--save_strategy epoch
--evaluation_strategy steps
--eval_steps 50
--logging_steps 10
--batch_size 1
--eval_batch_size 1
--max_length 17000
--check_dataset_strategy warning
--gradient_checkpointing true
--gradient_accumulation_steps 8
--weight_decay 0.01
--learning_rate 1e-5
--max_grad_norm 0.5
--warmup_ratio 0.03
--use_flash_attn true
--push_to_hub false
--deepspeed_config_path ds_z2_config.json
--save_only_model false
--save_on_each_node false
--lazy_tokenize true
--lisa_activated_layers 8
--lisa_step_interval 20
--neftune_noise_alpha 10
--dtype AUTO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant