You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have a "bucketing scheduler" which works based on one condition: KV cache compatibility. We bucket sequences by the KV cache length and run the shortest ones first according. This can starve resources from older (and longer) sequences while others are catching up, and can even stall them in the worst case.
A multiplexing scheduler would ensure that older sequences are able to run while the bucketing can still occur. We can multiplex sequences by defining a scheduling period $p$ which controls the number of scheduling passes to not run the lower-priority sequences. For example, if $p=2$, then the bucketed sequences will run for 2 scheduling cycles before the bucketing criterion is altered temporarily, and all older sequences are run.
#262 introduces dynamic LoRA swapping, and will allow per-sequence adapter activation. We can also consider this in the multiplexing scheme by putting sequences with the same adapter activations together. Sequences would be bucketed based on KV cache compatibility and then by LoRA adapter compatibility. The multiplexing case would also be altered.
The goal is to reduce sequence starvation and also support per-request request LoRA adapter activation.
The text was updated successfully, but these errors were encountered:
Currently, we have a "bucketing scheduler" which works based on one condition: KV cache compatibility. We bucket sequences by the KV cache length and run the shortest ones first according. This can starve resources from older (and longer) sequences while others are catching up, and can even stall them in the worst case.
A multiplexing scheduler would ensure that older sequences are able to run while the bucketing can still occur. We can multiplex sequences by defining a scheduling period$p$ which controls the number of scheduling passes to not run the lower-priority sequences. For example, if $p=2$ , then the bucketed sequences will run for 2 scheduling cycles before the bucketing criterion is altered temporarily, and all older sequences are run.
#262 introduces dynamic LoRA swapping, and will allow per-sequence adapter activation. We can also consider this in the multiplexing scheme by putting sequences with the same adapter activations together. Sequences would be bucketed based on KV cache compatibility and then by LoRA adapter compatibility. The multiplexing case would also be altered.
The goal is to reduce sequence starvation and also support per-request request LoRA adapter activation.
The text was updated successfully, but these errors were encountered: