-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC implementation of SmoothQuant #855
base: main
Are you sure you want to change the base?
Conversation
We need to merge that one first before this. Also can the TVM-side change be sent to |
Got it! I remember that ~3 months ago Unity branch did not work with mlc-llm in a proper way as opposed to mlc-relax. I will switch to Unity. |
|
||
# Run Decoder and update statistics for activations/weights | ||
for _ in range(config["decoder_invoke_num"]): | ||
# TODO: support softmax with temperature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does temperature affect calibration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation for this "TODO" comment was to align this pipeline with what we have in mlc_chat app.
mlc_llm/core.py
Outdated
mod = smoothquant(args, mod, model_names) | ||
utils.debug_dump_script(mod, "mod_smoothquant.py", args) | ||
else: | ||
mod = param_manager.transform_dequantize(mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need a discussion if we are adding a separate quantization path that doesn't go through the param manager cc @MasterJH5574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @masahi for pointing it out. This is rather important question. It will be a little tricky but waiting for other team members opinion.
mlc_llm/transform/smoothquant.py
Outdated
max_value = tvm.tir.max_value(out_dtype) | ||
dq_scale = R.multiply(R.astype(scale1, "float32"), R.astype(scale2, "float32")) | ||
out = R.multiply(R.astype(call, dtype="float32"), dq_scale) | ||
return R.astype(R.clip(out, min_value, max_value), dtype=out_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This manual quantize and dequantize should be replaced by Relax "QNN dialect" in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree! In my opinion QNN dialect is very desirable in Relax.
Switched to TVM/unity branch. PR#15686 |
mlc_llm/transform/smoothquant.py
Outdated
return call | ||
|
||
def make_scale_param(shape: relax.ShapeExpr, dtype: str) -> tvm.relax.Var: | ||
n = 1 if self.mode == "quantize" else shape[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what is "mode" and what shape[-1]
means here.
stat_mod = mlc_llm.transform.SmoothQuantStatCollector()(mod) | ||
stat_mod = mlc_llm.transform.FuseTransposeMatmul()(stat_mod) | ||
|
||
prefill, decode, kvc, _, _ = get_runtime_func(funcs, stat_mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make the pipeline operate on an arbitrary module, rather than requiring the whole prefill / decode modules as input?
For example, for testing purpose I might want to quantize only one matmul op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question! I've been thinking about it too...
At first glance each task (LLM, CV-task etc) has its own pipeline.
For example: pipeline for LLM includes run of encoder + several runs of decoder. Pipeline for Computer Vision task (resnet, inception etc) includes single run of "main" function (as an example). Unit tests can have different set of Relax functions.
I thought about to provide "simple" API for the user. Thus user can configure its own pipeline and run it. But now I don't have such solution. So, for me this is still open question how to make generic pipeline.
As for your example with matmul test only: earlier in this place I had a code that checked presence of "main" functions in IRModule. In case of presence I called another one pipeline (for unit tests). But anyway this is ugly approach and not generic..
So, the absence of such "generic" pipeline is the main reason why I decided to make this PR to mlc-llm, not to TVM/unity
assert args.build_model_only is False, "build_model_only in True is not supported in SMQ" | ||
params = load_params(args.artifact_path, device=smq_device) | ||
|
||
dataset, stop_tokens = _get_dummy_dataset(args.artifact_path, device=smq_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the earlier comment, I think it is important to decouple this function from operating on the whole model. Meaning params
can be a list with a single param in the minimum case, and dataset can be any list of vectors (not necessarily tokenized IDs).
dataset: List[tvm.nd.NDArray], | ||
config: Dict[str, Any], | ||
): | ||
mod = mlc_llm.transform.SmoothQuantAnnotator("quantize")(mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input mod
already has smooth
op right? Why do we need to annotate again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, You are right, we do have smooth
ops here. But these smooth
ops operate like multiply
op for activations smoothing. Here we annotate IRModule with new smooth
ops that will be converted into analogue of qnn.quantize
op (it should be qnn.quantize, but now this op is absent, that's why I use smooth op with attribute mode="quantize")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Rather than having a chain of smooth
ops that do different things, I hope we can replace the first smooth
with multiply
and the second one with quantize
.
scale_params = _calculate_quant_scale_params(fname, stat, config, tvm.cpu(0)) | ||
mod = relax.transform.BindParams(fname, scale_params)(mod) | ||
|
||
mod = mlc_llm.transform.SmoothQuantOpConverter("quantize")(mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary, given that L177 already annotates with mode = "quantize"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On L177 we annotate with mode = "identity"
attribute. This Pass just changes attribute from "identity"
--> "quantize"
. By design, it should convert smooth
op --> qnn.quantize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, L177 is applying a different pass SmoothQuantAnnotator
...
w_scale = make_scale_param(weights.struct_info.shape, weights.struct_info.dtype) | ||
lhs = R.smooth(act, a_scale, kind=1, mode="identity") | ||
rhs = R.smooth(weights, w_scale, kind=2, mode="identity") | ||
return R.linear(lhs, rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this pass is always followed by SmoothQuantStatCollector
, which replaces this dummy smooth
with another dummy op, why not add absmax here instead of adding the first dummy smooth
?
I think that will let us get rid of the need for the identity
mode, which in turn might also remove the need for R.smooth
altogether (instead we can just use R.multiply
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I need to think about it but I got your point and I like your idea to remove R.smooth
!
Also I use R.smooth
op to lower it to qnn.quantize
during calibration. Instead I can add qnn.quantize
op. Then we can remove R.smooth
if I am not mistaken. What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, rather than adding R.smooth
, I think we should just go ahead and add quantize / dequantize
. We don't need a full-blown "QNN dialect", we just need these two ops.
mlc_llm/transform/smoothquant.py
Outdated
R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity") | ||
) | ||
a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out") | ||
w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need R.absmax
at all, given that it is only used here and it is just legalized anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will answer this question in the following comment.
mlc_llm/transform/smoothquant.py
Outdated
R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity") | ||
) | ||
a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out") | ||
w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing many max
applied in the python side smoothquant_utils.py
as well.
- Can you document the purpose of each max where it is used?
- If we want to support per-channel or per-token quantization, which max needs to be modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few words about R.absmax
:
- It was added for optimization purpose only. I need to dump tensor for statistics collection. But for large models this memory footprint is huge! That's why I decided to introduce this op and dump already preprocessed tensors.
- Now this op operates like
R.abs() --> R.max() --> R.squeeze()
. I can use this sequence of ops instead of new one. - About per-channel or per-token quantization: yes, you are right. In this case it does not fit. We need to modify this op or add new one.
For current quantization scheme (per-tensor) we can avoid using R.absmax
and use R.abs() --> R.max() --> R.squeeze()
instead. For per-channel or per-token quantization we need extra ops for preprocessing.
In my opinion dump the whole tensors is not very good idea and we need to preprocess it before dump (to optimize memory footprint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are dumping max from cpp, why do we need np.max
on the python side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you tell me why having R.absmax
helps memory for per-channel or per-token cases, compared to a chain of plain ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About np.max:
I can not get rid of it, because it is used to merge calibration statistics from different elements in dataset.
For example:
We have 3 prompts in our dataset - prompt1, prompt2, prompt3. For each prompt we have its own statistics: stat1, stat2, stat3. To calculate final smoothing/quantization parameters I need to merge this stat: stat1, stat2, stat3 --> final_stat and work with this joined information. During this "merge" np.max
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About R.absmax
:
R.absmax
should be identical to the chain of plain ops. This op does not reduce memory footprint compared to this chain (R.abs, R.max, R.squeeze
).
But R.absmax
itself helps to reduce memory in the following way, example:
Given R.linear
op has the following input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
I do not save the whole tensor, after R.absmax
I get 1D tensor with 3 elements: [10, 20, 30]. And I dump only this 1D vector. As a result, memory reduce is 3x times (9 vs 3 elements).
This toy example demonstrates how R.absmax can help reduce memory consumption.
funcs: List[str], | ||
dataset: List[tvm.nd.NDArray], | ||
config: Dict[str, Any], | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to implement a fairly standard calibration process. Is it possible to apply this without _smooth
?
The goal is to demonstrate the accuracy improvement brought by smoothing, by running _calibration
with or without smoothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible. For quick testing I usually run only _smooth
without _calibrate
.
I need to double check some passes because I use pattern matching there. But by design it should work (_calibration
without _smooth
).
print("[SmoothQuant] Run smoothing...") | ||
mod = _smooth(mod, params, model_names, dataset, smq_config) | ||
print("[SmoothQuant] Run calibration and quantization...") | ||
mod = _calibrate(mod, params, model_names, dataset, smq_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to collect stats twice? Can't we use the first stats + smoothing scale to calculate smoothed max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: for current quantization scheme (per-tensor) we can call it once.
In the previous comment I mentioned that I dump preprocessed tensors (R.absmax
, for memory footprint optimization). For per-channel or per-token quantization we need call it twice. Or we need to dump the whole tensor, but I do not think it is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per-channel or per-token quantization we need call it twice
Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To demonstrate this here is the same example from above:
We have R.linear
op and input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
In the current implementation I use R.absmax
and dump 1D tensor: [10, 20, 30].
But for groupwise, per-token (or per-channel) quantization scheme looks like this 1D tensor is not enough to calculate all scales. As I understand we need the whole 2D tensor.
So, why do I like this op so much... :)
From my real life experiments with LLM on A10g GPU I met cases when I do not fit into 22 GB of memory. As a result program crashed.
So, I decided to focus on memory optimization instead of calibration time optimization. So this way R.absmax
was born... :)
But as I sad, for per-tensor Q scheme this 1D vector is enough to calculate scale. And we can collect stats once.
I would like to say the following:
If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization)
Given that you are only supporting per-tensor for now and that memory optimization is only done, I assume that single run + memory optimization doesn't require any non-trivial engineering. So I'd say why not.
That said, I still don't understand why we ever need to collect twice, even for per channel / token. We can keep track of "running max" as we send new data to the model. After all data are consumed, we are left with per-channel or token "max vector", obtained by one pass. We don't have to dump anything else.
@masahi Thank you for review! All questions are reasonable and require clarification. I will address it one by one soon. |
mlc_llm/transform/smoothquant.py
Outdated
act_scale = wildcard() | ||
w_scale = wildcard() | ||
lhs_sm = is_op("relax.annotate.smooth")(wildcard(), act_scale).has_attr(attrs) | ||
rhs_sm = is_op("relax.annotate.smooth")(wildcard(), w_scale).has_attr(attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should replace them with a proper quantize
op.
mlc_llm/utils.py
Outdated
|
||
mod_transform = relax.transform.DeadCodeElimination(transform_func_names)(mod_transform) | ||
mod_deploy = relax.transform.DeadCodeElimination(model_names)(mod_deploy) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add
mod_deploy = mod_deploy.with_attrs(
{
"external_mods": mod.get_attr("external_mods"),
"const_name_to_constant": mod.get_attr("const_name_to_constant"),
}
)
to support BYOC.
Main follow ups after discussion:
|
24949b0
to
58be070
Compare
This commit removed usage of R.smooth op with "multiply" attrs. Instead R.divide and R.multiply are used.
This commit renames old smq_a8q8f16 to smq_q8i8f16_0 (per-tensor quantization scheme) and adds new on smq_q8i8f16_1 (per-channel for weights quantization scheme).
This commit adds asymmetric quantization scheme "smq_q8i8f16_2".
246eabd
to
29501d3
Compare
This commit fixes issue with adding of new outputs in the Dataflow block in the SmoothQuantStatCollector pass.
This commit adds support of gsm8k dataset for calibration. Co-authored-by: Ailurus1 <kozulin02@yandex.com>
This is PoC implementation of SmoothQuant.
Some important notes:
NoQuantizationSpec
was used in theQuantizationScheme
description.smq_q8i8f16_0
- int8, per-tensor, symmetric for activations and int8, per-tensor, symmetric for weights.smq_q8i8f16_1
- int8, per-tensor, symmetric for activations and int8, per-channel, symmetric for weights.smq_q8i8f16_2
- int8, per-tensor, asymmetric for activations and int8, per-channel, asymmetric for weights.How to run:
python build.py --model=models/Llama-2-7b-chat-hf --use-cache=0 --quantization=smq_q8i8f16_2 --max-seq-len=2048 --dataset=dummy
Performance:
By default, this implementation tries to offload linear ops to cuBLAS codegen. That's why it was decided to compare with q8f16_ft and q0f16 (cuBLAS). It was tested with Llama-2-7b-chat-hf. Example of performance numbers obtained with
examples/python/benchmark.py
on A10g GPU.In case of small number of input tokens q8f16_ft outperforms smq_q8i8f16_2:
But in case of bigger number of input tokens the situation is opposite and smq_q8i8f16_2 outperforms q8f16_ft (1.5x for prefill). Example with ~300 input tokens:
FYI cc @masahi