Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for using Kuberay #266

Open
karthik-nexusflow opened this issue Apr 10, 2024 · 4 comments
Open

Documentation for using Kuberay #266

karthik-nexusflow opened this issue Apr 10, 2024 · 4 comments

Comments

@karthik-nexusflow
Copy link

Hi Team,
It would be great if kuberay commands to run openrlhf is added in the docs ,to make the cold start easier to set it up

@karthik-nexusflow
Copy link
Author

You can also dump the commands you use / I can help with the docs from a user perspective , once I get it setup

@wuxibin89
Copy link
Collaborator

wuxibin89 commented Apr 11, 2024

@karthik-nexusflow Setup ray cluster and submit openrlhf job to ray cluster are 2 separate stages.

  1. To setup multi nodes ray cluster, there're plenty options depends on your infrastructure.
  • If you have already done ML workflow on kubernetes, then kuberay is the best option to launch ray cluster.
  • If you only have a few nodes(e.g 3~5), then manually start ray head and worker node is the simplest way.
# start head node first
ray start --head --port=6379 --node-ip-address=10.0.0.1

# start worker node 1
ray start --node-ip-address=10.0.0.2 --address=10.0.0.1:6379

# start worker node 2
ray start --node-ip-address=10.0.0.3 --address=10.0.0.1:6379
  1. After your ray cluster is setup, then just submit openrlhf job to ray cluster dashboard like below:
ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \
    --no-wait \
    -- python3 examples/train_ppo_ray.py \
    ...

Stage 2 is independent on how you launch a ray cluster and you can launch multiple jobs to the same cluster.

@karthik-nexusflow
Copy link
Author

Thank you ,

for 1. Kuberay it would be great you can share the docker file you are using

for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 12, 2024

Thank you ,

for 1. Kuberay it would be great you can share the docker file you are using

for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?

We have provided the vllm-based dockerfile https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile
You could modify it based on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants