The configuration for Llama-7b on 4 RTX4090 #269

LinkyLiu · 2024-04-15T14:27:41Z

Hello, I want to run train_ppo_llama_ray.sh on 4 RTX4090, should I modify the actor_num_gpus_per_node/critic_num_gpus_per_node in train_ppo_llama_ray.sh ? As the default script is for 8 gpus, what else should I pay attention to or should be modified?

hijkzzz · 2024-04-16T00:23:40Z

actor, critic, rm, init nodes = 1,1,1,1 with adam offload + gradient_checkpoint

LinkyLiu · 2024-04-17T14:08:25Z

@hijkzzz Thank you for replying! But I met this problem, do you know how to solve it ?

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/bin/python3.10/dist-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: ActorModelRayActor
        actor_id: 53688e714f4881c3b3028ed402000000
        pid: 3752
        namespace: f4c18cbd-bbfb-4d8b-acf3-3aa591111fe9
        ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

hijkzzz · 2024-04-18T23:43:31Z

Do you have more detailed logs + running envs + launch commands?

libowen424 · 2024-04-22T03:27:07Z

i success on the following configuration:

`
set -x
export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}'
-- python3 examples/train_ppo_ray.py
--ref_num_nodes 1
--ref_num_gpus_per_node 1
--reward_num_nodes 1
--reward_num_gpus_per_node 1
--critic_num_nodes 1
--critic_num_gpus_per_node 1
--actor_num_nodes 1
--actor_num_gpus_per_node 1
--pretrain /root/.cache/huggingface/hub/llama-2-7b-chat-hf
--reward_pretrain /root/.cache/huggingface/hub/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama
--micro_train_batch_size 2
--train_batch_size 128
--micro_rollout_batch_size 4
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward
--prompt_data_probs 0.4,0.5,0.1
--max_samples 80000
--normalize_reward
--actor_init_on_gpu
--adam_offload
--flash_attn
--gradient_checkpointing
--lora_rank 4
`

wuxibin89 · 2024-04-22T12:20:12Z

@LinkyLiu Ray actor has died unexpectedly, please check ray log in /tmp/ray/session_latest/logs/: raylet.out, raylet.err, job-xxx.log. There should be more information about why the actor die.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The configuration for Llama-7b on 4 RTX4090 #269

The configuration for Llama-7b on 4 RTX4090 #269

LinkyLiu commented Apr 15, 2024 •

edited

hijkzzz commented Apr 16, 2024 •

edited

LinkyLiu commented Apr 17, 2024

hijkzzz commented Apr 18, 2024 •

edited

libowen424 commented Apr 22, 2024

wuxibin89 commented Apr 22, 2024

The configuration for Llama-7b on 4 RTX4090 #269

The configuration for Llama-7b on 4 RTX4090 #269

Comments

LinkyLiu commented Apr 15, 2024 • edited

hijkzzz commented Apr 16, 2024 • edited

LinkyLiu commented Apr 17, 2024

hijkzzz commented Apr 18, 2024 • edited

libowen424 commented Apr 22, 2024

wuxibin89 commented Apr 22, 2024

LinkyLiu commented Apr 15, 2024 •

edited

hijkzzz commented Apr 16, 2024 •

edited

hijkzzz commented Apr 18, 2024 •

edited