New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some "senseful" fallbacks for isq
#272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, I think it looks great. My only concern is when we skip a tensor: it looks like we upcast to F32. This could increase memory usage, while we already have the unquantized tensor that we can return. Do you think it would be a good idea to just return that and avoid the cast?
QuantizationBehaviour::Skip => { | ||
let shape = t.shape(); | ||
warn!("Skipping quantization of tensor with shape {shape:?} as it is not quantizable."); | ||
QMatMul::QTensor(Arc::new(QTensor::quantize(&t, GgmlDType::F32).unwrap())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of skipping the tensor, can we avoid converting it to F32 by just returning tensor
. For example, if you are loading on CUDA, the tensors are loaded into (cpu) memory in BF16, so converting to F32 may incur a performance and memory cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if we can keep the tensors in their original dtype
that would be great. Can we just return a normal Tensor
here or does it need to be a QMatMul::QTensor
?
And is it save to keep the original dtype
, or do we have a risk for miss-matched dtypes
later on? (I thought the ggml matmul always returns a f32/f16 now) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah looks like, the QMatMul
also simply accepts a Tensor
or TensorF16
variant. Should we return the TensorF16
if we run on a gpu or should that be handled elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right, I overlooked that. The ggml matmul always returns f32/f16 now so we'd probably get a dtype error if some other tensors had ISQ applied. I think in that case it's better to pay the cost upfront in total memory usage rather than cast at runtime, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: when we support matmul via F16, we should update this to match.
Thank you! |
Since the
k-quants
can be quite strict with their block size of256
we can simply fallback to a simillar "normal" quantization level and if that doesnt work we should just skip the tensor and print a warning.