NeuronX Nemo-Megatron test case outdated #274

KeitaW · 2024-04-17T23:20:15Z

Reported by Marius Moisescu. Thanks for reporting!

Hi Keita, how are you?
I am trying to adapt 3.test_cases/8.neuronx-nemo-megatron from https://github.com/aws-samples/awsome-distributed-training/ for our integ tests, essentially a "training job integ test" Things used to work, then I took a month break from this and I came back to it, just to notice that it does not work anymore.
I figured that packages have shifted so I started digging into it
3.test_cases/8.neuronx-nemo-megatron/1.setup-venv.sh installs python -m pip install neuronx-cc==2.* torch-neuronx torchvision and 3.test_cases/8.neuronx-nemo-megatron/2.setup-neuronx-nemo-megatron.sh installs pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3 but it seems that https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-base-dlami.html#setup-torch-neuronx-ubuntu20-base-dlami suggests that if we want to work with 1.13 we need to do python -m pip install neuronx-cc==2.* torch-neuronx==1.13.* torchvision I tried that (and I also tried 2.1.2 in both installs, but I am getting stuck in 5.precompile-model.sh (4.tokenize.sbatch works).
I get no slurm log for 5.precompile-model.sh -model.sh so something must be really off. I saw errors of the type JobState=PENDING Reason=PartitionConfig (my nodes were just hanging in that state) but now I do not even get that, the job simply fails silently.

KeitaW self-assigned this Apr 17, 2024

KeitaW added the bug Something isn't working label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeuronX Nemo-Megatron test case outdated #274

NeuronX Nemo-Megatron test case outdated #274

KeitaW commented Apr 17, 2024

NeuronX Nemo-Megatron test case outdated #274

NeuronX Nemo-Megatron test case outdated #274

Comments

KeitaW commented Apr 17, 2024