Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 训练报错indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. #208

Open
canghaiyunfan opened this issue Apr 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@canghaiyunfan
Copy link

描述该错误

2024-04-19 06:06:37,071 INFO writer.py:60 in init_tb_writer -- Login tensorboard logs to: RUN/7b_internlm2_train/04-19-06.06.02/tensorboards 2024-04-19 06:06:37,761 ERROR train.py:307 in <module> -- Raise exception from c394df8c9997 with rank id: 0 Traceback (most recent call last): File "/data/InternEvo/train.py", line 305, in <module> main(args) File "/data/InternEvo/train.py", line 215, in main _, _, loss = trainer.execute_schedule( File "/data/InternEvo/internlm/core/trainer.py", line 213, in execute_schedule return self._schedule.forward_backward_step(self._engine, /data_iter, **kwargs) File "/data/InternEvo/internlm/utils/timeout.py", line 102, in wrapper result = func(*args, **kwargs) File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 220, in forward_backward_step _output, _loss, _moe_loss = self._train_one_batch( File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 125, in _train_one_batch output = self._call_engine(engine, /data) File "/data/InternEvo/internlm/core/scheduler/base_scheduler.py", line 86, in _call_engine return engine(**inputs) File "/data/InternEvo/internlm/core/engine.py", line 164, in __call__ return self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/core/naive_amp.py", line 155, in forward out = self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modeling_internlm2.py", line 934, in forward hidden_states = self.tok_embeddings(input_ids) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modules/embedding.py", line 66, in forward output = F.embedding(input_, self.weight, self.padding_idx, *self.embed_args, **self.embed_kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97371d5617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f973719098d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9737286518 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f973868a150 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f973868df78 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7f97386a47bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f97386a4ac8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6bf0 (0x7f97a7af3bf0 in /usr/local/gcc-10.2.0/lib64/libstdc++.so.6)
frame #8: + 0x8609 (0x7f97d2bff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f97d29ca133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelect

环境信息

软件环境: ubuntu 官方镜像
硬件环境:A800

其他信息

No response

@canghaiyunfan canghaiyunfan added the bug Something isn't working label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant