Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up Llama3-70B inference? #163

Closed
yuanjunchai opened this issue Apr 28, 2024 · 1 comment
Closed

How to speed up Llama3-70B inference? #163

yuanjunchai opened this issue Apr 28, 2024 · 1 comment
Labels
model-usage Issues related to how models are used/loaded

Comments

@yuanjunchai
Copy link

Hi Llama3 team,

Could you help me figure out methods to speed up the 70B model inference time?
It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.

@subramen
Copy link
Contributor

subramen commented May 1, 2024

Hi, you could try using torch.compile(mode='reduce-overhead') to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes

@subramen subramen added the model-usage Issues related to how models are used/loaded label May 1, 2024
@jspisak jspisak closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage Issues related to how models are used/loaded
Projects
None yet
Development

No branches or pull requests

3 participants