-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched Prefill #219
Batched Prefill #219
Conversation
Code Metrics Report─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Total 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 69,864 Estimated Schedule Effort 11.811066 months Estimated People Required 5.038645 ─────────────────────────────────────────────────────────────────────────────── Processed 768517 bytes, 0.769 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good as an initial implementation! One improvement that I see is to allow a batch size greater than one. To do this, we should modify the scheduler so that it schedules prompt sequences with token length > 512 together. Alternatively, we could not do that and just modify get_prompt_input
to return some sort of iterator over the chunks, for which some sequences would not be present. I think the latter is definitely more complicated, and so it is probably a worse option, although I am not sure about the performance cost.
If we are to continue with that approach we must change the scheduler to forbid it from scheduling more than 512 tokens for completion
Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod
.
I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4
Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill. So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below
This could work... In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything. |
@EricLBuehler if we remove parallel prefill then I think the approach of a165b7d might work? |
So my current reasoning is:
So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach. Can you please implement it for the rest of the models?
I cherry-picked the proper commits to #234 and I'll close this |
No description provided.