Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some "senseful" fallbacks for isq #272

Merged
merged 1 commit into from May 11, 2024

Conversation

LLukas22
Copy link
Contributor

@LLukas22 LLukas22 commented May 8, 2024

Since the k-quants can be quite strict with their block size of 256 we can simply fallback to a simillar "normal" quantization level and if that doesnt work we should just skip the tensor and print a warning.

@EricLBuehler EricLBuehler added new feature New feature or request backend Backend work labels May 8, 2024
Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, I think it looks great. My only concern is when we skip a tensor: it looks like we upcast to F32. This could increase memory usage, while we already have the unquantized tensor that we can return. Do you think it would be a good idea to just return that and avoid the cast?

QuantizationBehaviour::Skip => {
let shape = t.shape();
warn!("Skipping quantization of tensor with shape {shape:?} as it is not quantizable.");
QMatMul::QTensor(Arc::new(QTensor::quantize(&t, GgmlDType::F32).unwrap()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of skipping the tensor, can we avoid converting it to F32 by just returning tensor. For example, if you are loading on CUDA, the tensors are loaded into (cpu) memory in BF16, so converting to F32 may incur a performance and memory cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we can keep the tensors in their original dtype that would be great. Can we just return a normal Tensor here or does it need to be a QMatMul::QTensor?

And is it save to keep the original dtype, or do we have a risk for miss-matched dtypes later on? (I thought the ggml matmul always returns a f32/f16 now) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah looks like, the QMatMul also simply accepts a Tensor or TensorF16 variant. Should we return the TensorF16 if we run on a gpu or should that be handled elsewhere?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, I overlooked that. The ggml matmul always returns f32/f16 now so we'd probably get a dtype error if some other tensors had ISQ applied. I think in that case it's better to pay the cost upfront in total memory usage rather than cast at runtime, what do you think?

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: when we support matmul via F16, we should update this to match.

@EricLBuehler EricLBuehler merged commit 62e4402 into EricLBuehler:master May 11, 2024
7 of 11 checks passed
@EricLBuehler
Copy link
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend work new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants