PoC implementation of SmoothQuant #855

ibsidorenko · 2023-09-01T09:34:49Z

This is PoC implementation of SmoothQuant.

Some important notes:

It quantizes not only weights, but also activations. This quantization was not integrated into ParamManager and lives as a separate pipeline. That's why NoQuantizationSpec was used in the QuantizationScheme description.
3 new quantization schemes were implemented in this PR:
- smq_q8i8f16_0 - int8, per-tensor, symmetric for activations and int8, per-tensor, symmetric for weights.
- smq_q8i8f16_1 - int8, per-tensor, symmetric for activations and int8, per-channel, symmetric for weights.
- smq_q8i8f16_2 - int8, per-tensor, asymmetric for activations and int8, per-channel, asymmetric for weights.
It quantizes only R.linear ops (not matmul ops).

How to run:
python build.py --model=models/Llama-2-7b-chat-hf --use-cache=0 --quantization=smq_q8i8f16_2 --max-seq-len=2048 --dataset=dummy

Performance:

By default, this implementation tries to offload linear ops to cuBLAS codegen. That's why it was decided to compare with q8f16_ft and q0f16 (cuBLAS). It was tested with Llama-2-7b-chat-hf. Example of performance numbers obtained with examples/python/benchmark.py on A10g GPU.

In case of small number of input tokens q8f16_ft outperforms smq_q8i8f16_2:

Q scheme	Prefill, tok/s	Decoder, tok/s
smq_q8i8f16_2	414.5	56.5
q8f16_ft	466.2	58.6
q0f16	259.7	35.4

But in case of bigger number of input tokens the situation is opposite and smq_q8i8f16_2 outperforms q8f16_ft (1.5x for prefill). Example with ~300 input tokens:

Q scheme	Prefill, tok/s	Decoder, tok/s
smq_q8i8f16_2	6050.1	54.7
q8f16_ft	4259.2	56.6

FYI cc @masahi

masahi · 2023-09-05T09:01:12Z

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

ibsidorenko · 2023-09-05T09:49:01Z

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

Got it! I remember that ~3 months ago Unity branch did not work with mlc-llm in a proper way as opposed to mlc-relax. I will switch to Unity.

masahi · 2023-09-05T09:04:52Z

mlc_llm/quantization/smoothquant_utils.py

+
+        # Run Decoder and update statistics for activations/weights
+        for _ in range(config["decoder_invoke_num"]):
+            # TODO: support softmax with temperature.


Does temperature affect calibration?

Motivation for this "TODO" comment was to align this pipeline with what we have in mlc_chat app.

masahi · 2023-09-05T09:09:51Z

mlc_llm/core.py

+        mod = smoothquant(args, mod, model_names)
+        utils.debug_dump_script(mod, "mod_smoothquant.py", args)
+    else:
+        mod = param_manager.transform_dequantize(mod)


Might need a discussion if we are adding a separate quantization path that doesn't go through the param manager cc @MasterJH5574

Thanks, @masahi for pointing it out. This is rather important question. It will be a little tricky but waiting for other team members opinion.

masahi · 2023-09-06T06:20:18Z

mlc_llm/transform/smoothquant.py

+                    max_value = tvm.tir.max_value(out_dtype)
+                    dq_scale = R.multiply(R.astype(scale1, "float32"), R.astype(scale2, "float32"))
+                    out = R.multiply(R.astype(call, dtype="float32"), dq_scale)
+                    return R.astype(R.clip(out, min_value, max_value), dtype=out_dtype)


This manual quantize and dequantize should be replaced by Relax "QNN dialect" in the near future.

Strongly agree! In my opinion QNN dialect is very desirable in Relax.

ibsidorenko · 2023-09-06T19:01:10Z

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

Switched to TVM/unity branch. PR#15686

masahi · 2023-09-11T01:21:42Z

mlc_llm/transform/smoothquant.py

+            return call
+
+        def make_scale_param(shape: relax.ShapeExpr, dtype: str) -> tvm.relax.Var:
+            n = 1 if self.mode == "quantize" else shape[-1]


Document what is "mode" and what shape[-1] means here.

masahi · 2023-09-11T01:49:08Z

mlc_llm/quantization/smoothquant_utils.py

+    stat_mod = mlc_llm.transform.SmoothQuantStatCollector()(mod)
+    stat_mod = mlc_llm.transform.FuseTransposeMatmul()(stat_mod)
+
+    prefill, decode, kvc, _, _ = get_runtime_func(funcs, stat_mod)


Is it possible to make the pipeline operate on an arbitrary module, rather than requiring the whole prefill / decode modules as input?

For example, for testing purpose I might want to quantize only one matmul op.

Very good question! I've been thinking about it too...

At first glance each task (LLM, CV-task etc) has its own pipeline.
For example: pipeline for LLM includes run of encoder + several runs of decoder. Pipeline for Computer Vision task (resnet, inception etc) includes single run of "main" function (as an example). Unit tests can have different set of Relax functions.

I thought about to provide "simple" API for the user. Thus user can configure its own pipeline and run it. But now I don't have such solution. So, for me this is still open question how to make generic pipeline.

As for your example with matmul test only: earlier in this place I had a code that checked presence of "main" functions in IRModule. In case of presence I called another one pipeline (for unit tests). But anyway this is ugly approach and not generic..

So, the absence of such "generic" pipeline is the main reason why I decided to make this PR to mlc-llm, not to TVM/unity

masahi · 2023-09-11T01:51:01Z

mlc_llm/quantization/smoothquant_utils.py

+    assert args.build_model_only is False, "build_model_only in True is not supported in SMQ"
+    params = load_params(args.artifact_path, device=smq_device)
+
+    dataset, stop_tokens = _get_dummy_dataset(args.artifact_path, device=smq_device)


Related to the earlier comment, I think it is important to decouple this function from operating on the whole model. Meaning params can be a list with a single param in the minimum case, and dataset can be any list of vectors (not necessarily tokenized IDs).

masahi · 2023-09-11T02:14:34Z

mlc_llm/quantization/smoothquant_utils.py

+    dataset: List[tvm.nd.NDArray],
+    config: Dict[str, Any],
+):
+    mod = mlc_llm.transform.SmoothQuantAnnotator("quantize")(mod)


The input mod already has smooth op right? Why do we need to annotate again?

Yes, You are right, we do have smooth ops here. But these smooth ops operate like multiply op for activations smoothing. Here we annotate IRModule with new smooth ops that will be converted into analogue of qnn.quantize op (it should be qnn.quantize, but now this op is absent, that's why I use smooth op with attribute mode="quantize")

I see. Rather than having a chain of smooth ops that do different things, I hope we can replace the first smooth with multiply and the second one with quantize.

masahi · 2023-09-11T02:15:13Z

mlc_llm/quantization/smoothquant_utils.py

+        scale_params = _calculate_quant_scale_params(fname, stat, config, tvm.cpu(0))
+        mod = relax.transform.BindParams(fname, scale_params)(mod)
+
+    mod = mlc_llm.transform.SmoothQuantOpConverter("quantize")(mod)


Is this line necessary, given that L177 already annotates with mode = "quantize"?

On L177 we annotate with mode = "identity" attribute. This Pass just changes attribute from "identity" --> "quantize". By design, it should convert smooth op --> qnn.quantize

oh I see, L177 is applying a different pass SmoothQuantAnnotator...

masahi · 2023-09-11T02:25:33Z

mlc_llm/transform/smoothquant.py

+        w_scale = make_scale_param(weights.struct_info.shape, weights.struct_info.dtype)
+        lhs = R.smooth(act, a_scale, kind=1, mode="identity")
+        rhs = R.smooth(weights, w_scale, kind=2, mode="identity")
+        return R.linear(lhs, rhs)


Since this pass is always followed by SmoothQuantStatCollector, which replaces this dummy smooth with another dummy op, why not add absmax here instead of adding the first dummy smooth?

I think that will let us get rid of the need for the identity mode, which in turn might also remove the need for R.smooth altogether (instead we can just use R.multiply).

hmm... I need to think about it but I got your point and I like your idea to remove R.smooth!

Also I use R.smooth op to lower it to qnn.quantize during calibration. Instead I can add qnn.quantize op. Then we can remove R.smooth if I am not mistaken. What do you think about it?

yeah, rather than adding R.smooth, I think we should just go ahead and add quantize / dequantize. We don't need a full-blown "QNN dialect", we just need these two ops.

masahi · 2023-09-11T02:28:47Z

mlc_llm/transform/smoothquant.py

+                        R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity")
+                    )
+                    a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out")
+                    w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out")


Do we need R.absmax at all, given that it is only used here and it is just legalized anyway?

I will answer this question in the following comment.

masahi · 2023-09-11T02:43:11Z

mlc_llm/transform/smoothquant.py

+                        R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity")
+                    )
+                    a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out")
+                    w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out")


I'm seeing many max applied in the python side smoothquant_utils.py as well.

Can you document the purpose of each max where it is used?

If we want to support per-channel or per-token quantization, which max needs to be modified?

A few words about R.absmax:

It was added for optimization purpose only. I need to dump tensor for statistics collection. But for large models this memory footprint is huge! That's why I decided to introduce this op and dump already preprocessed tensors.

Now this op operates like R.abs() --> R.max() --> R.squeeze(). I can use this sequence of ops instead of new one.

About per-channel or per-token quantization: yes, you are right. In this case it does not fit. We need to modify this op or add new one.

For current quantization scheme (per-tensor) we can avoid using R.absmax and use R.abs() --> R.max() --> R.squeeze() instead. For per-channel or per-token quantization we need extra ops for preprocessing.
In my opinion dump the whole tensors is not very good idea and we need to preprocess it before dump (to optimize memory footprint)

If we are dumping max from cpp, why do we need np.max on the python side?

And can you tell me why having R.absmax helps memory for per-channel or per-token cases, compared to a chain of plain ops?

About np.max:
I can not get rid of it, because it is used to merge calibration statistics from different elements in dataset.
For example:
We have 3 prompts in our dataset - prompt1, prompt2, prompt3. For each prompt we have its own statistics: stat1, stat2, stat3. To calculate final smoothing/quantization parameters I need to merge this stat: stat1, stat2, stat3 --> final_stat and work with this joined information. During this "merge" np.max is used.

About R.absmax:
R.absmax should be identical to the chain of plain ops. This op does not reduce memory footprint compared to this chain (R.abs, R.max, R.squeeze).

But R.absmax itself helps to reduce memory in the following way, example:
Given R.linear op has the following input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
I do not save the whole tensor, after R.absmax I get 1D tensor with 3 elements: [10, 20, 30]. And I dump only this 1D vector. As a result, memory reduce is 3x times (9 vs 3 elements).

This toy example demonstrates how R.absmax can help reduce memory consumption.

masahi · 2023-09-11T02:56:05Z

mlc_llm/quantization/smoothquant_utils.py

+    funcs: List[str],
+    dataset: List[tvm.nd.NDArray],
+    config: Dict[str, Any],
+):


This function seems to implement a fairly standard calibration process. Is it possible to apply this without _smooth?

The goal is to demonstrate the accuracy improvement brought by smoothing, by running _calibration with or without smoothing.

Yes, it is possible. For quick testing I usually run only _smooth without _calibrate.
I need to double check some passes because I use pattern matching there. But by design it should work (_calibration without _smooth).

masahi · 2023-09-11T03:05:55Z

mlc_llm/quantization/smoothquant_utils.py

+        print("[SmoothQuant] Run smoothing...")
+        mod = _smooth(mod, params, model_names, dataset, smq_config)
+        print("[SmoothQuant] Run calibration and quantization...")
+        mod = _calibrate(mod, params, model_names, dataset, smq_config)


Do we need to collect stats twice? Can't we use the first stats + smoothing scale to calculate smoothed max?

Short answer: for current quantization scheme (per-tensor) we can call it once.

In the previous comment I mentioned that I dump preprocessed tensors (R.absmax, for memory footprint optimization). For per-channel or per-token quantization we need call it twice. Or we need to dump the whole tensor, but I do not think it is a good idea.

per-channel or per-token quantization we need call it twice

Why is that?

To demonstrate this here is the same example from above:
We have R.linear op and input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
In the current implementation I use R.absmax and dump 1D tensor: [10, 20, 30].
But for groupwise, per-token (or per-channel) quantization scheme looks like this 1D tensor is not enough to calculate all scales. As I understand we need the whole 2D tensor.

So, why do I like this op so much... :)
From my real life experiments with LLM on A10g GPU I met cases when I do not fit into 22 GB of memory. As a result program crashed.
So, I decided to focus on memory optimization instead of calibration time optimization. So this way R.absmax was born... :)
But as I sad, for per-tensor Q scheme this 1D vector is enough to calculate scale. And we can collect stats once.

I would like to say the following:
If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization).
What do you think?

If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization)

Given that you are only supporting per-tensor for now and that memory optimization is only done, I assume that single run + memory optimization doesn't require any non-trivial engineering. So I'd say why not.

That said, I still don't understand why we ever need to collect twice, even for per channel / token. We can keep track of "running max" as we send new data to the model. After all data are consumed, we are left with per-channel or token "max vector", obtained by one pass. We don't have to dump anything else.

ibsidorenko · 2023-09-11T12:42:58Z

@masahi Thank you for review! All questions are reasonable and require clarification. I will address it one by one soon.

masahi · 2023-09-11T18:21:32Z

mlc_llm/transform/smoothquant.py

+        act_scale = wildcard()
+        w_scale = wildcard()
+        lhs_sm = is_op("relax.annotate.smooth")(wildcard(), act_scale).has_attr(attrs)
+        rhs_sm = is_op("relax.annotate.smooth")(wildcard(), w_scale).has_attr(attrs)


We should replace them with a proper quantize op.

masahi · 2023-09-14T18:54:08Z

mlc_llm/utils.py

+
+    mod_transform = relax.transform.DeadCodeElimination(transform_func_names)(mod_transform)
+    mod_deploy = relax.transform.DeadCodeElimination(model_names)(mod_deploy)
+


Need to add

mod_deploy = mod_deploy.with_attrs( { "external_mods": mod.get_attr("external_mods"), "const_name_to_constant": mod.get_attr("const_name_to_constant"), } )

to support BYOC.

ibsidorenko · 2023-09-19T13:37:03Z

Main follow ups after discussion:

Remove usage of R.absmax op (use chain of simple ops: abs, max)
Remove R.smooth op: use multiply or qnn.quantize (depends on PR#15772)
Use single run for statistics collection

This commit removed usage of R.smooth op with "multiply" attrs. Instead R.divide and R.multiply are used.

This commit renames old smq_a8q8f16 to smq_q8i8f16_0 (per-tensor quantization scheme) and adds new on smq_q8i8f16_1 (per-channel for weights quantization scheme).

This commit adds asymmetric quantization scheme "smq_q8i8f16_2".

This commit fixes issue with adding of new outputs in the Dataflow block in the SmoothQuantStatCollector pass.

This commit adds support of gsm8k dataset for calibration. Co-authored-by: Ailurus1 <kozulin02@yandex.com>

masahi reviewed Sep 6, 2023

View reviewed changes

masahi reviewed Sep 11, 2023

View reviewed changes

masahi mentioned this pull request Sep 11, 2023

[Unity] Add new Relax annotation ops: smooth and absmax apache/tvm#15686

Closed

masahi reviewed Sep 11, 2023

View reviewed changes

masahi reviewed Sep 14, 2023

View reviewed changes

MasterJH5574 force-pushed the main branch 2 times, most recently from 24949b0 to 58be070 Compare September 22, 2023 16:55

ibsidorenko added 14 commits November 28, 2023 14:21

[SmoothQuant] PoC implementation

18d195a

[SmoothQuant] Port TVM passes from mlc-relax to mlc-llm

b22e335

[SmoothQuant] Refactoring

f16582f

Remove commented code

07a4157

[SmoothQuant] Removed usage of R.absmax op.

50fdccf

[SmoothQuant] Address review comments.

b9f5191

[SmoothQuant] Remove usage of R.smooth op with "multiply" attrs

d099986

This commit removed usage of R.smooth op with "multiply" attrs. Instead R.divide and R.multiply are used.

[SmoothQuant] Add support of PIQA dataset

bd3063b

[SmoothQuant] Add smq_q8i8f16_0 and smq_q8i8f16_1 Q schemes.

75cfa4f

This commit renames old smq_a8q8f16 to smq_q8i8f16_0 (per-tensor quantization scheme) and adds new on smq_q8i8f16_1 (per-channel for weights quantization scheme).

Reimplement _accumulate_outlier_stat functionality

1239b16

[SmoothQuant] Add smq_q8i8f16_2 quantization scheme

9753cd9

This commit adds asymmetric quantization scheme "smq_q8i8f16_2".

[SmoothQuant] Add usage of R.quantize/R.dequantize ops

ffbee0d

[SmoothQuant] Enable cuBLAS offloading

a4615e3

[SmoothQuant] Merge smoothing and params transform into one step

8a3e469

ibsidorenko added 2 commits November 28, 2023 14:25

[SmoothQuant] Remove R.smooth operation

0472ab6

[SmoothQuant] Fix after rebase

29501d3

ibsidorenko force-pushed the isidoren/smoothquant-v0.4 branch from 246eabd to 29501d3 Compare November 29, 2023 13:15

ibsidorenko and others added 3 commits December 7, 2023 18:04

[SmoothQuant] Fix adding of new block outputs

7e78a01

This commit fixes issue with adding of new outputs in the Dataflow block in the SmoothQuantStatCollector pass.

[SmoothQuant] Add dlight schedule to speedup calibration.

744b6f3

[SmoothQuant] Add gsm8k dataset

8b601d4

This commit adds support of gsm8k dataset for calibration. Co-authored-by: Ailurus1 <kozulin02@yandex.com>

Neet-Nestor force-pushed the main branch from 9905667 to 14bec5a Compare May 27, 2024 06:42


		mod_transform = relax.transform.DeadCodeElimination(transform_func_names)(mod_transform)
		mod_deploy = relax.transform.DeadCodeElimination(model_names)(mod_deploy)

PoC implementation of SmoothQuant #855

Are you sure you want to change the base?

PoC implementation of SmoothQuant #855

Conversation

ibsidorenko commented Sep 1, 2023 • edited

masahi commented Sep 5, 2023

ibsidorenko commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibsidorenko commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibsidorenko Sep 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibsidorenko Sep 12, 2023 • edited

Choose a reason for hiding this comment

masahi Sep 13, 2023 • edited

Choose a reason for hiding this comment

ibsidorenko commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibsidorenko commented Sep 19, 2023 • edited

ibsidorenko commented Sep 1, 2023 •

edited

ibsidorenko Sep 11, 2023 •

edited

ibsidorenko Sep 12, 2023 •

edited

masahi Sep 13, 2023 •

edited

ibsidorenko commented Sep 19, 2023 •

edited