Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用torchrun 启动分布式训练,发现每个容器里产生很多个训练进程,无法训练 #638

Closed
apachemycat opened this issue May 4, 2024 · 3 comments

Comments

@apachemycat
Copy link

torchrun /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py

root 24 1 2 01:27 ? 00:00:06 /usr/bin/python /usr/local/bin/torchrun /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 76 24 99 01:27 ? 00:07:19 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 132 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 133 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 136 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 138 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 140 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 142 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 144 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py
root 146 76 0 01:27 ? 00:00:00 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --launcher pytorch --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work config.py

日志输出:

train work dir .... /models/meta-Llama-3-8B-xtuner-trainer/demo-pytorch-dist-elasiticjob-worker-1/train-work
[2024-05-04 01:27:38,370] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] Starting elastic_operator with launch configs:
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] entrypoint : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] min_nodes : 2
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] max_nodes : 2
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] nproc_per_node : 1
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] run_id : none
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] rdzv_backend : c10d
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] rdzv_endpoint : demo-pytorch-dist-elasiticjob-worker-0:23456
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] rdzv_configs : {'timeout': 900}
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] max_restarts : 20
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] monitor_interval : 5
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] log_dir : None
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO] metrics_cfg : {}
[2024-05-04 01:27:38,370] torch.distributed.launcher.api: [INFO]
[2024-05-04 01:27:38,390] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] log directory set to: /tmp/torchelastic_xdum05zp/none_serimfuw
[2024-05-04 01:27:38,390] torch.distributed.elastic.agent.server.api: [INFO] [default] starting workers for entrypoint: python
[2024-05-04 01:27:38,390] torch.distributed.elastic.agent.server.api: [INFO] [default] Rendezvous'ing worker group
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] [default] Rendezvous complete for workers. Result:
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] restart_count=0
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] master_addr=demo-pytorch-dist-elasiticjob-worker-0
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] master_port=57801
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] group_rank=1
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] group_world_size=2
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] local_ranks=[0]
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] role_ranks=[1]
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] global_ranks=[1]
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] role_world_sizes=[2]
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO] global_world_sizes=[2]
[2024-05-04 01:27:39,071] torch.distributed.elastic.agent.server.api: [INFO]
[2024-05-04 01:27:39,072] torch.distributed.elastic.agent.server.api: [INFO] [default] Starting worker group
[2024-05-04 01:27:39,073] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
[2024-05-04 01:27:39,074] torch.distributed.elastic.multiprocessing: [INFO] Setting worker0 reply file to: /tmp/torchelastic_xdum05zp/none_serimfuw/attempt_0/0/error.json
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:46: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
low_cpu_mem_usage was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [00:09<00:00, 2.47s/it]
Did not find last_checkpoint to be resumed.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:434: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

@pppppM
Copy link
Collaborator

pppppM commented May 6, 2024

@apachemycat 可能是因为torchrun 的命令里没有指定 master_portnproc_per_node

可以参考

### torchrun
Note: `$NODE_0_ADDR` means the ip address of the node_0 machine.
```bash
# excuete on node 0
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero3
# excuete on node 1
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero3
```

@apachemycat
Copy link
Author

有的,是通过弹性的分布式,环境变量中给了,可能是其他问题

@apachemycat
Copy link
Author

确定了,是nccl+torch版本不匹配的问题,解决了,多谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants