Understanding the Quantization Support in llama.cpp #6997
Unanswered
lalith1403
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While running FP32 and INT8 Quantized Models through the OneAPI stack on CPUs, we observe SGEMM kernels being called from MKL. The number of calls for SGEMM kernel calls and the individual timing of each kernel on both the instances are comparable, with quantized model doing slightly better than FP32 model. In such a scenario, how are we seeing the advantages kicking in with quantized models?
PS: Logs attached for reference!
fp32_mkl_logs.txt
q8_mkl_logs.txt
Beta Was this translation helpful? Give feedback.
All reactions