async/parallel speculative execution #6853
Replies: 3 comments 9 replies
-
Interesting experiment. I always thought that the memory bandwidth available on the M chips is shared between the CPU and the GPU, so I figured there would be contention if we try to compute in parallel. Can you describe in more details the reconciliation process between the draft and target sequences? How does it defer from standard speculative sampling? You can run tests with make -j speculative && time ./speculative \
-m ./models/llama-70b-v3-instruct/ggml-model-q8_0.gguf \
-md ./models/llama-8b-v3-instruct/ggml-model-q5_k.gguf \
-f in.txt -e -ngl 99 -n 4096 -c 4096 -s 20 -np 1 --draft 12 --color -s 1 --temp 0.0 -n 1024 You can vary the The main issue with speculative approaches on Mac currently is the inefficient quantum batched matrix multiplications as we discussed in #6777 |
Beta Was this translation helpful? Give feedback.
-
This is awesome, thank you! |
Beta Was this translation helpful? Give feedback.
-
Ok, cool, let me try and see if I can make it work with some good tree-based speculation + current llama70B. I don't have a second powerful machine to distribute main model though, - only one m2 ultra and m2 laptop. Once I figured that out will update here |
Beta Was this translation helpful? Give feedback.
-
Was there an attempt to run draft model in parallel with main model on difference compute device of the same machine?
Here's a small illustration i made: c8d446d
Experiment setup/observations:
Observe ~83s to process the prompt and produce the output.
Observe ~64s to process the same prompt and produce same output. Not dramatic, but fairly noticeable.
Beta Was this translation helpful? Give feedback.
All reactions