Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray serve get stuck when loading two or more applications #3

Open
Dolfik1 opened this issue Jan 11, 2024 · 1 comment
Open

ray serve get stuck when loading two or more applications #3

Dolfik1 opened this issue Jan 11, 2024 · 1 comment

Comments

@Dolfik1
Copy link

Dolfik1 commented Jan 11, 2024

This is my .yaml configuration file:

# Serve config file
#
# For documentation see: 
# https://docs.ray.io/en/latest/serve/production-guide/config.html

host: 0.0.0.0
port: 8000

applications:
- name: demo_app
  route_prefix: /a
  import_path: ray_vllm_inference.vllm_serve:deployment

  runtime_env:
    env_vars:
      HUGGING_FACE_HUB_TOKEN: hf_1234
    pip:
    - ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inference
  args:
    model: facebook/opt-13b
    tensor_parallel_size: 4
  deployments:
  - name: VLLMInference
    num_replicas: 1
    # Maximum backlog for a single replica
    max_concurrent_queries: 10
    ray_actor_options:
      num_gpus: 4

- name: demo_app2
  route_prefix: /b
  import_path: ray_vllm_inference.vllm_serve:deployment

  runtime_env:
    env_vars:
      HUGGING_FACE_HUB_TOKEN: hf_1234
    pip:
    - ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inference
  args:
    model: facebook/opt-13b
    tensor_parallel_size: 4
  deployments:
  - name: VLLMInference
    num_replicas: 1
    # Maximum backlog for a single replica
    max_concurrent_queries: 10
    ray_actor_options:
      num_gpus: 4

I attempt to execute it using the command serve run config2.yaml. However, the deployment process stuck and never complete. Here are the logs:
[logs]

2024-01-11 12:58:28,970 INFO scripts.py:442 -- Running config file: 'config2.yaml'.
2024-01-11 12:58:30,870 INFO worker.py:1664 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2024-01-11 12:58:33,757 SUCC scripts.py:543 -- Submitted deploy config successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,752 controller 1450442 application_state.py:386 - Building application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,756 controller 1450442 application_state.py:386 - Building application 'demo_app2'.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,727 proxy 10.10.29.89 proxy.py:1072 - Proxy actor 4b0df404e3c5af4bd834d1ab01000000 starting on node b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,732 proxy 10.10.29.89 proxy.py:1257 - Starting HTTP server on node: b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495 listening on port 8000
(ProxyActor pid=1450530) INFO:     Started server process [1450530]
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,180 controller 1450442 application_state.py:477 - Built application 'demo_app' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,182 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,284 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,302 controller 1450442 application_state.py:477 - Built application 'demo_app2' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,304 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app2'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,406 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app2'.
(ServeReplica:demo_app:VLLMInference pid=1468450) INFO 2024-01-11 12:58:45,015 VLLMInference demo_app#VLLMInference#WArOfC vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app2:VLLMInference pid=1468458) INFO 2024-01-11 12:58:45,021 VLLMInference demo_app2#VLLMInference#xOjgzS vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app:VLLMInference pid=1468450) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app:VLLMInference pid=1468450) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,292 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:demo_app2:VLLMInference pid=1468458) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app2:VLLMInference pid=1468458) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,494 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,363 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,566 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,441 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,645 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.

Interestingly, when I disable the demo_app2 application within the configuration by commenting it out, the deployment proceeds without any issues.. I have 8 GPUs on my server, so, it should be enough for configuration provided above.

I've also attempted to create my own deployment in Python, bypassing the use of the ray_vllm_inference library, but I encountered the same problems. I noticed that the vLLM application seems to be utilizing the incorrect GPUs. When I logged the CUDA_VISIBLE_DEVICES variable in the initialization function, it displayed 0,1,2,3. However, according to nvidia-smi, vLLM is actually using GPUs 4,5,6,7.

In an attempt to troubleshoot, I created a custom deployment using the SDXL model (also two). This worked perfectly, with the model using the exact GPUs as specified in the CUDA_VISIBLE_DEVICES variable.

@Dolfik1
Copy link
Author

Dolfik1 commented Jan 11, 2024

I've found the problem, and posted it here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant