-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched & chunked prefill #234
Conversation
Code Metrics Report─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Total 70 23339 1550 508 21281 1281 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 69,864 Estimated Schedule Effort 11.811066 months Estimated People Required 5.038645 ─────────────────────────────────────────────────────────────────────────────── Processed 768517 bytes, 0.769 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── |
f32::NEG_INFINITY | ||
} else { | ||
(0..u).map(move |j| { | ||
if j + t + self.sliding_window.unwrap_or(tgt_len + 1) > i + u { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the sliding window part is right
f32::NEG_INFINITY | ||
} else { | ||
(0..u).map(move |j| { | ||
if j + t + self.sliding_window > i + u { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the sliding window part is right
pub struct InputMetadata { | ||
pub input: Tensor, | ||
pub positions: Vec<usize>, | ||
pub positions_kernel: Tensor, // [bs, seq len] | ||
pub context_lens: Vec<usize>, | ||
} | ||
|
||
fn calculate_inputs_prompt_batched( | ||
seq: &mut Sequence, | ||
device: &Device, | ||
chunk_size: usize, | ||
) -> Result<Vec<InputMetadata>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth it making InputMetadata
public? Or should this function return ModelInputs
?
fn get_prefill_chunk_size(&self) -> usize { | ||
512 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable?
This value should be as large as possible for performance, under the constraint that it doesn't OOM the system.
@@ -460,16 +460,12 @@ impl Model { | |||
tgt_len: usize, | |||
seqlen_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function already passed the seqlen_offset
. I think it was already being correctly calculated for the current use-case, but I'm not sure
@@ -561,16 +561,12 @@ impl XLoraModel { | |||
tgt_len: usize, | |||
seqlen_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function already passed the seqlen_offset
. I think it was already being correctly calculated for the current use-case, but I'm not sure
@EricLBuehler I'd appreciate help on testing this. It changed a lot of models, some of them I have never used. I did test the quantized llama I usually run a lot though. Both on |
@lucasavila00, absolutely, I'll test the models out. |
Continuing #219
Closes #216