Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The configuration for Llama-7b on 4 RTX4090 #269

Open
LinkyLiu opened this issue Apr 15, 2024 · 5 comments
Open

The configuration for Llama-7b on 4 RTX4090 #269

LinkyLiu opened this issue Apr 15, 2024 · 5 comments

Comments

@LinkyLiu
Copy link

LinkyLiu commented Apr 15, 2024

Hello, I want to run train_ppo_llama_ray.sh on 4 RTX4090, should I modify the actor_num_gpus_per_node/critic_num_gpus_per_node in train_ppo_llama_ray.sh ? As the default script is for 8 gpus, what else should I pay attention to or should be modified?

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 16, 2024

actor, critic, rm, init nodes = 1,1,1,1 with adam offload + gradient_checkpoint

@LinkyLiu
Copy link
Author

@hijkzzz Thank you for replying! But I met this problem, do you know how to solve it ?

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/bin/python3.10/dist-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: ActorModelRayActor
        actor_id: 53688e714f4881c3b3028ed402000000
        pid: 3752
        namespace: f4c18cbd-bbfb-4d8b-acf3-3aa591111fe9
        ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 18, 2024

Do you have more detailed logs + running envs + launch commands?

@libowen424
Copy link

i success on the following configuration:

`
set -x
export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}'
-- python3 examples/train_ppo_ray.py
--ref_num_nodes 1
--ref_num_gpus_per_node 1
--reward_num_nodes 1
--reward_num_gpus_per_node 1
--critic_num_nodes 1
--critic_num_gpus_per_node 1
--actor_num_nodes 1
--actor_num_gpus_per_node 1
--pretrain /root/.cache/huggingface/hub/llama-2-7b-chat-hf
--reward_pretrain /root/.cache/huggingface/hub/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama
--micro_train_batch_size 2
--train_batch_size 128
--micro_rollout_batch_size 4
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward
--prompt_data_probs 0.4,0.5,0.1
--max_samples 80000
--normalize_reward
--actor_init_on_gpu
--adam_offload
--flash_attn
--gradient_checkpointing
--lora_rank 4
`

@wuxibin89
Copy link
Collaborator

@LinkyLiu Ray actor has died unexpectedly, please check ray log in /tmp/ray/session_latest/logs/: raylet.out, raylet.err, job-xxx.log. There should be more information about why the actor die.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants