Batched & chunked prefill #234

lucasavila00 · 2024-04-28T03:03:37Z

Continuing #219

Closes #216

github-actions · 2024-04-28T03:03:54Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Total                       70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 69,864
Estimated Schedule Effort 11.811066 months
Estimated People Required 5.038645
───────────────────────────────────────────────────────────────────────────────
Processed 768517 bytes, 0.769 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

lucasavila00 · 2024-04-28T15:50:31Z

mistralrs-core/src/xlora_models/mistral.rs

-                        f32::NEG_INFINITY
-                    } else {
+                (0..u).map(move |j| {
+                    if j + t + self.sliding_window.unwrap_or(tgt_len + 1) > i + u {


I'm not sure the sliding window part is right

lucasavila00 · 2024-04-28T15:50:36Z

mistralrs-core/src/xlora_models/mixtral.rs

-                        f32::NEG_INFINITY
-                    } else {
+                (0..u).map(move |j| {
+                    if j + t + self.sliding_window > i + u {


I'm not sure the sliding window part is right

lucasavila00 · 2024-04-28T15:51:18Z

mistralrs-core/src/pipeline/mod.rs

+pub struct InputMetadata {
+    pub input: Tensor,
+    pub positions: Vec<usize>,
+    pub positions_kernel: Tensor, // [bs, seq len]
+    pub context_lens: Vec<usize>,
+}
+
+fn calculate_inputs_prompt_batched(
+    seq: &mut Sequence,
+    device: &Device,
+    chunk_size: usize,
+) -> Result<Vec<InputMetadata>> {


Is it worth it making InputMetadata public? Or should this function return ModelInputs?

lucasavila00 · 2024-04-28T15:53:14Z

mistralrs-core/src/pipeline/mod.rs

+    fn get_prefill_chunk_size(&self) -> usize {
+        512
+    }


Should this be configurable?

This value should be as large as possible for performance, under the constraint that it doesn't OOM the system.

lucasavila00 · 2024-04-28T15:54:28Z

mistralrs-core/src/xlora_models/phi3.rs

@@ -460,16 +460,12 @@ impl Model {
        tgt_len: usize,
        seqlen_offset: usize,


This function already passed the seqlen_offset. I think it was already being correctly calculated for the current use-case, but I'm not sure

lucasavila00 · 2024-04-28T15:54:42Z

mistralrs-core/src/xlora_models/gemma.rs

@@ -561,16 +561,12 @@ impl XLoraModel {
        tgt_len: usize,
        seqlen_offset: usize,


This function already passed the seqlen_offset. I think it was already being correctly calculated for the current use-case, but I'm not sure

lucasavila00 · 2024-04-28T15:56:24Z

@EricLBuehler I'd appreciate help on testing this.

It changed a lot of models, some of them I have never used.

I did test the quantized llama I usually run a lot though. Both on mistral-bench and talking to it with the interactive mode.

EricLBuehler · 2024-04-28T16:00:20Z

@lucasavila00, absolutely, I'll test the models out.

lucasavila00 added 5 commits April 27, 2024 23:59

Scaffold Batched Prefill

c7bbe89

remove parallel prefill

d1d2314

fix masks?

d8798aa

implement it for all models

c327991

add proper calls

7cf9899

lucasavila00 mentioned this pull request Apr 28, 2024

Batched Prefill #219

Closed

lucasavila00 commented Apr 28, 2024

View reviewed changes

lucasavila00 mentioned this pull request Apr 28, 2024

Quantized Mistral: Prompt processing slower than llama.cpp #153

Closed

lucasavila00 changed the title ~~Batched prefill~~ Batched & chunked prefill Apr 29, 2024

lucasavila00 closed this May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched & chunked prefill #234

Batched & chunked prefill #234

lucasavila00 commented Apr 28, 2024 •

edited

github-actions bot commented Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

		@@ -460,16 +460,12 @@ impl Model {
		tgt_len: usize,
		seqlen_offset: usize,

		@@ -561,16 +561,12 @@ impl XLoraModel {
		tgt_len: usize,
		seqlen_offset: usize,

Batched & chunked prefill #234

Batched & chunked prefill #234

Conversation

lucasavila00 commented Apr 28, 2024 • edited

github-actions bot commented Apr 28, 2024

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 28, 2024

Choose a reason for hiding this comment

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 •

edited