Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeuronX Nemo-Megatron test case outdated #274

Open
KeitaW opened this issue Apr 17, 2024 · 0 comments
Open

NeuronX Nemo-Megatron test case outdated #274

KeitaW opened this issue Apr 17, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@KeitaW
Copy link
Contributor

KeitaW commented Apr 17, 2024

Reported by Marius Moisescu. Thanks for reporting!

Hi Keita, how are you?
I am trying to adapt 3.test_cases/8.neuronx-nemo-megatron from https://github.com/aws-samples/awsome-distributed-training/ for our integ tests, essentially a "training job integ test" Things used to work, then I took a month break from this and I came back to it, just to notice that it does not work anymore.
I figured that packages have shifted so I started digging into it
3.test_cases/8.neuronx-nemo-megatron/1.setup-venv.sh installs python -m pip install neuronx-cc==2.* torch-neuronx torchvision and 3.test_cases/8.neuronx-nemo-megatron/2.setup-neuronx-nemo-megatron.sh installs pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3 but it seems that https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-base-dlami.html#setup-torch-neuronx-ubuntu20-base-dlami suggests that if we want to work with 1.13 we need to do python -m pip install neuronx-cc==2.* torch-neuronx==1.13.* torchvision I tried that (and I also tried 2.1.2 in both installs, but I am getting stuck in 5.precompile-model.sh (4.tokenize.sbatch works).
I get no slurm log for 5.precompile-model.sh-model.sh so something must be really off. I saw errors of the type JobState=PENDING Reason=PartitionConfig (my nodes were just hanging in that state) but now I do not even get that, the job simply fails silently.

@KeitaW KeitaW self-assigned this Apr 17, 2024
@KeitaW KeitaW added the bug Something isn't working label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant