You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reported by Marius Moisescu. Thanks for reporting!
Hi Keita, how are you?
I am trying to adapt 3.test_cases/8.neuronx-nemo-megatron from https://github.com/aws-samples/awsome-distributed-training/ for our integ tests, essentially a "training job integ test" Things used to work, then I took a month break from this and I came back to it, just to notice that it does not work anymore.
I figured that packages have shifted so I started digging into it
3.test_cases/8.neuronx-nemo-megatron/1.setup-venv.sh installs python -m pip install neuronx-cc==2.* torch-neuronx torchvision and 3.test_cases/8.neuronx-nemo-megatron/2.setup-neuronx-nemo-megatron.sh installs pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3 but it seems that https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-base-dlami.html#setup-torch-neuronx-ubuntu20-base-dlami suggests that if we want to work with 1.13 we need to do python -m pip install neuronx-cc==2.* torch-neuronx==1.13.* torchvision I tried that (and I also tried 2.1.2 in both installs, but I am getting stuck in 5.precompile-model.sh (4.tokenize.sbatch works).
I get no slurm log for 5.precompile-model.sh-model.sh so something must be really off. I saw errors of the type JobState=PENDING Reason=PartitionConfig (my nodes were just hanging in that state) but now I do not even get that, the job simply fails silently.
The text was updated successfully, but these errors were encountered:
Reported by Marius Moisescu. Thanks for reporting!
Hi Keita, how are you?
I am trying to adapt 3.test_cases/8.neuronx-nemo-megatron from https://github.com/aws-samples/awsome-distributed-training/ for our integ tests, essentially a "training job integ test" Things used to work, then I took a month break from this and I came back to it, just to notice that it does not work anymore.
I figured that packages have shifted so I started digging into it
3.test_cases/8.neuronx-nemo-megatron/1.setup-venv.sh installs python -m pip install neuronx-cc==2.* torch-neuronx torchvision and 3.test_cases/8.neuronx-nemo-megatron/2.setup-neuronx-nemo-megatron.sh installs pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3 but it seems that https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-base-dlami.html#setup-torch-neuronx-ubuntu20-base-dlami suggests that if we want to work with 1.13 we need to do python -m pip install neuronx-cc==2.* torch-neuronx==1.13.* torchvision I tried that (and I also tried 2.1.2 in both installs, but I am getting stuck in 5.precompile-model.sh (4.tokenize.sbatch works).
I get no slurm log for 5.precompile-model.sh-model.sh so something must be really off. I saw errors of the type JobState=PENDING Reason=PartitionConfig (my nodes were just hanging in that state) but now I do not even get that, the job simply fails silently.
The text was updated successfully, but these errors were encountered: