How to speed up Llama3-70B inference? #163

yuanjunchai · 2024-04-28T06:58:37Z

Hi Llama3 team,

Could you help me figure out methods to speed up the 70B model inference time?
It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.

subramen · 2024-05-01T17:06:03Z

Hi, you could try using torch.compile(mode='reduce-overhead') to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes

subramen added the model-usage Issues related to how models are used/loaded label May 1, 2024

jspisak closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up Llama3-70B inference? #163

How to speed up Llama3-70B inference? #163

yuanjunchai commented Apr 28, 2024

subramen commented May 1, 2024

How to speed up Llama3-70B inference? #163

How to speed up Llama3-70B inference? #163

Comments

yuanjunchai commented Apr 28, 2024

subramen commented May 1, 2024