You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you help me figure out methods to speed up the 70B model inference time?
It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.
The text was updated successfully, but these errors were encountered:
Hi, you could try using torch.compile(mode='reduce-overhead') to speed up inference with CUDA graphs. We have some examples using VLLM here: https://github.com/meta-llama/llama-recipes
Hi Llama3 team,
Could you help me figure out methods to speed up the 70B model inference time?
It seems that only one content needs more than 50s to inference, and I have use TensorRT but not so apparent speeding up.
The text was updated successfully, but these errors were encountered: