Add some "senseful" fallbacks for `isq` #272

LLukas22 · 2024-05-08T13:30:19Z

Since the k-quants can be quite strict with their block size of 256 we can simply fallback to a simillar "normal" quantization level and if that doesnt work we should just skip the tensor and print a warning.

EricLBuehler

Thanks for adding this, I think it looks great. My only concern is when we skip a tensor: it looks like we upcast to F32. This could increase memory usage, while we already have the unquantized tensor that we can return. Do you think it would be a good idea to just return that and avoid the cast?

EricLBuehler · 2024-05-08T13:42:49Z

mistralrs-core/src/pipeline/mod.rs

+                        QuantizationBehaviour::Skip => {
+                            let shape = t.shape();
+                            warn!("Skipping quantization of tensor with shape {shape:?} as it is not quantizable.");
+                            QMatMul::QTensor(Arc::new(QTensor::quantize(&t, GgmlDType::F32).unwrap()))


In the case of skipping the tensor, can we avoid converting it to F32 by just returning tensor. For example, if you are loading on CUDA, the tensors are loaded into (cpu) memory in BF16, so converting to F32 may incur a performance and memory cost.

Yeah if we can keep the tensors in their original dtype that would be great. Can we just return a normal Tensor here or does it need to be a QMatMul::QTensor?

And is it save to keep the original dtype, or do we have a risk for miss-matched dtypes later on? (I thought the ggml matmul always returns a f32/f16 now) 🤔

Ah looks like, the QMatMul also simply accepts a Tensor or TensorF16 variant. Should we return the TensorF16 if we run on a gpu or should that be handled elsewhere?

Oh, you're right, I overlooked that. The ggml matmul always returns f32/f16 now so we'd probably get a dtype error if some other tensors had ISQ applied. I think in that case it's better to pay the cost upfront in total memory usage rather than cast at runtime, what do you think?

EricLBuehler

Note: when we support matmul via F16, we should update this to match.

EricLBuehler · 2024-05-11T11:20:40Z

Thank you!

Add Quantization Fallbacks

68d99f7

EricLBuehler added new feature New feature or request backend Backend work labels May 8, 2024

EricLBuehler reviewed May 8, 2024

View reviewed changes

EricLBuehler approved these changes May 11, 2024

View reviewed changes

EricLBuehler merged commit 62e4402 into EricLBuehler:master May 11, 2024
7 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some "senseful" fallbacks for `isq` #272

Add some "senseful" fallbacks for `isq` #272

LLukas22 commented May 8, 2024

EricLBuehler left a comment

EricLBuehler May 8, 2024

LLukas22 May 11, 2024

LLukas22 May 11, 2024

EricLBuehler May 11, 2024

EricLBuehler left a comment

EricLBuehler commented May 11, 2024

Add some "senseful" fallbacks for isq #272

Add some "senseful" fallbacks for isq #272

Conversation

LLukas22 commented May 8, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler May 8, 2024

Choose a reason for hiding this comment

LLukas22 May 11, 2024

Choose a reason for hiding this comment

LLukas22 May 11, 2024

Choose a reason for hiding this comment

EricLBuehler May 11, 2024

Choose a reason for hiding this comment

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented May 11, 2024

Add some "senseful" fallbacks for `isq` #272

Add some "senseful" fallbacks for `isq` #272