What's the optimal parallel strategy using TensorRT-LLM? #8

iteratorlee · 2024-03-28T09:51:00Z

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

hanlint · 2024-03-28T13:41:37Z

cc: @megha95

dskhudia · 2024-03-28T17:09:48Z

TP is better as at lower batch sizes it allows better load balance. At higher batch sizes, they should be similar. We haven't benchmarked EP yet.

hanlint assigned megha95 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the optimal parallel strategy using TensorRT-LLM? #8

What's the optimal parallel strategy using TensorRT-LLM? #8

iteratorlee commented Mar 28, 2024

hanlint commented Mar 28, 2024

dskhudia commented Mar 28, 2024

What's the optimal parallel strategy using TensorRT-LLM? #8

What's the optimal parallel strategy using TensorRT-LLM? #8

Comments

iteratorlee commented Mar 28, 2024

hanlint commented Mar 28, 2024

dskhudia commented Mar 28, 2024