Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC implementation of SmoothQuant #855

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

ibsidorenko
Copy link

@ibsidorenko ibsidorenko commented Sep 1, 2023

This is PoC implementation of SmoothQuant.

Some important notes:

  • It quantizes not only weights, but also activations. This quantization was not integrated into ParamManager and lives as a separate pipeline. That's why NoQuantizationSpec was used in the QuantizationScheme description.
  • 3 new quantization schemes were implemented in this PR:
    • smq_q8i8f16_0 - int8, per-tensor, symmetric for activations and int8, per-tensor, symmetric for weights.
    • smq_q8i8f16_1 - int8, per-tensor, symmetric for activations and int8, per-channel, symmetric for weights.
    • smq_q8i8f16_2 - int8, per-tensor, asymmetric for activations and int8, per-channel, asymmetric for weights.
  • It quantizes only R.linear ops (not matmul ops).

How to run:
python build.py --model=models/Llama-2-7b-chat-hf --use-cache=0 --quantization=smq_q8i8f16_2 --max-seq-len=2048 --dataset=dummy

Performance:

By default, this implementation tries to offload linear ops to cuBLAS codegen. That's why it was decided to compare with q8f16_ft and q0f16 (cuBLAS). It was tested with Llama-2-7b-chat-hf. Example of performance numbers obtained with examples/python/benchmark.py on A10g GPU.

In case of small number of input tokens q8f16_ft outperforms smq_q8i8f16_2:

Q scheme Prefill, tok/s Decoder, tok/s
smq_q8i8f16_2 414.5 56.5
q8f16_ft 466.2 58.6
q0f16 259.7 35.4

But in case of bigger number of input tokens the situation is opposite and smq_q8i8f16_2 outperforms q8f16_ft (1.5x for prefill). Example with ~300 input tokens:

Q scheme Prefill, tok/s Decoder, tok/s
smq_q8i8f16_2 6050.1 54.7
q8f16_ft 4259.2 56.6

FYI cc @masahi

@masahi
Copy link
Contributor

masahi commented Sep 5, 2023

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

@ibsidorenko
Copy link
Author

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

Got it! I remember that ~3 months ago Unity branch did not work with mlc-llm in a proper way as opposed to mlc-relax. I will switch to Unity.


# Run Decoder and update statistics for activations/weights
for _ in range(config["decoder_invoke_num"]):
# TODO: support softmax with temperature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does temperature affect calibration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation for this "TODO" comment was to align this pipeline with what we have in mlc_chat app.

mlc_llm/core.py Outdated
mod = smoothquant(args, mod, model_names)
utils.debug_dump_script(mod, "mod_smoothquant.py", args)
else:
mod = param_manager.transform_dequantize(mod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a discussion if we are adding a separate quantization path that doesn't go through the param manager cc @MasterJH5574

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @masahi for pointing it out. This is rather important question. It will be a little tricky but waiting for other team members opinion.

max_value = tvm.tir.max_value(out_dtype)
dq_scale = R.multiply(R.astype(scale1, "float32"), R.astype(scale2, "float32"))
out = R.multiply(R.astype(call, dtype="float32"), dq_scale)
return R.astype(R.clip(out, min_value, max_value), dtype=out_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manual quantize and dequantize should be replaced by Relax "QNN dialect" in the near future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree! In my opinion QNN dialect is very desirable in Relax.

@ibsidorenko
Copy link
Author

To make it work, it is required to apply patch to mlc-relax

We need to merge that one first before this. Also can the TVM-side change be sent to apache/unity? Personally I don't use the mlc-relax branch, mlc-llm has been working with the upstream unity branch for a while.

Switched to TVM/unity branch. PR#15686

return call

def make_scale_param(shape: relax.ShapeExpr, dtype: str) -> tvm.relax.Var:
n = 1 if self.mode == "quantize" else shape[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what is "mode" and what shape[-1] means here.

stat_mod = mlc_llm.transform.SmoothQuantStatCollector()(mod)
stat_mod = mlc_llm.transform.FuseTransposeMatmul()(stat_mod)

prefill, decode, kvc, _, _ = get_runtime_func(funcs, stat_mod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make the pipeline operate on an arbitrary module, rather than requiring the whole prefill / decode modules as input?

For example, for testing purpose I might want to quantize only one matmul op.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question! I've been thinking about it too...

At first glance each task (LLM, CV-task etc) has its own pipeline.
For example: pipeline for LLM includes run of encoder + several runs of decoder. Pipeline for Computer Vision task (resnet, inception etc) includes single run of "main" function (as an example). Unit tests can have different set of Relax functions.

I thought about to provide "simple" API for the user. Thus user can configure its own pipeline and run it. But now I don't have such solution. So, for me this is still open question how to make generic pipeline.

As for your example with matmul test only: earlier in this place I had a code that checked presence of "main" functions in IRModule. In case of presence I called another one pipeline (for unit tests). But anyway this is ugly approach and not generic..

So, the absence of such "generic" pipeline is the main reason why I decided to make this PR to mlc-llm, not to TVM/unity

assert args.build_model_only is False, "build_model_only in True is not supported in SMQ"
params = load_params(args.artifact_path, device=smq_device)

dataset, stop_tokens = _get_dummy_dataset(args.artifact_path, device=smq_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the earlier comment, I think it is important to decouple this function from operating on the whole model. Meaning params can be a list with a single param in the minimum case, and dataset can be any list of vectors (not necessarily tokenized IDs).

dataset: List[tvm.nd.NDArray],
config: Dict[str, Any],
):
mod = mlc_llm.transform.SmoothQuantAnnotator("quantize")(mod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input mod already has smooth op right? Why do we need to annotate again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, You are right, we do have smooth ops here. But these smooth ops operate like multiply op for activations smoothing. Here we annotate IRModule with new smooth ops that will be converted into analogue of qnn.quantize op (it should be qnn.quantize, but now this op is absent, that's why I use smooth op with attribute mode="quantize")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Rather than having a chain of smooth ops that do different things, I hope we can replace the first smooth with multiply and the second one with quantize.

scale_params = _calculate_quant_scale_params(fname, stat, config, tvm.cpu(0))
mod = relax.transform.BindParams(fname, scale_params)(mod)

mod = mlc_llm.transform.SmoothQuantOpConverter("quantize")(mod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary, given that L177 already annotates with mode = "quantize"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On L177 we annotate with mode = "identity" attribute. This Pass just changes attribute from "identity" --> "quantize". By design, it should convert smooth op --> qnn.quantize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, L177 is applying a different pass SmoothQuantAnnotator...

w_scale = make_scale_param(weights.struct_info.shape, weights.struct_info.dtype)
lhs = R.smooth(act, a_scale, kind=1, mode="identity")
rhs = R.smooth(weights, w_scale, kind=2, mode="identity")
return R.linear(lhs, rhs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this pass is always followed by SmoothQuantStatCollector, which replaces this dummy smooth with another dummy op, why not add absmax here instead of adding the first dummy smooth?

I think that will let us get rid of the need for the identity mode, which in turn might also remove the need for R.smooth altogether (instead we can just use R.multiply).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I need to think about it but I got your point and I like your idea to remove R.smooth!

Also I use R.smooth op to lower it to qnn.quantize during calibration. Instead I can add qnn.quantize op. Then we can remove R.smooth if I am not mistaken. What do you think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, rather than adding R.smooth, I think we should just go ahead and add quantize / dequantize. We don't need a full-blown "QNN dialect", we just need these two ops.

R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity")
)
a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out")
w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need R.absmax at all, given that it is only used here and it is just legalized anyway?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will answer this question in the following comment.

R.smooth(m_smq2.args[0], relax.Constant(w_scale), kind=2, mode="identity")
)
a_out = self.builder_.emit(R.absmax(a_smq, kind=1), "a_out")
w_out = self.builder_.emit(R.absmax(w_smq, kind=2), "w_out")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing many max applied in the python side smoothquant_utils.py as well.

  1. Can you document the purpose of each max where it is used?
  2. If we want to support per-channel or per-token quantization, which max needs to be modified?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few words about R.absmax:

  1. It was added for optimization purpose only. I need to dump tensor for statistics collection. But for large models this memory footprint is huge! That's why I decided to introduce this op and dump already preprocessed tensors.
  2. Now this op operates like R.abs() --> R.max() --> R.squeeze(). I can use this sequence of ops instead of new one.
  3. About per-channel or per-token quantization: yes, you are right. In this case it does not fit. We need to modify this op or add new one.

For current quantization scheme (per-tensor) we can avoid using R.absmax and use R.abs() --> R.max() --> R.squeeze() instead. For per-channel or per-token quantization we need extra ops for preprocessing.
In my opinion dump the whole tensors is not very good idea and we need to preprocess it before dump (to optimize memory footprint)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are dumping max from cpp, why do we need np.max on the python side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you tell me why having R.absmax helps memory for per-channel or per-token cases, compared to a chain of plain ops?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About np.max:
I can not get rid of it, because it is used to merge calibration statistics from different elements in dataset.
For example:
We have 3 prompts in our dataset - prompt1, prompt2, prompt3. For each prompt we have its own statistics: stat1, stat2, stat3. To calculate final smoothing/quantization parameters I need to merge this stat: stat1, stat2, stat3 --> final_stat and work with this joined information. During this "merge" np.max is used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About R.absmax:
R.absmax should be identical to the chain of plain ops. This op does not reduce memory footprint compared to this chain (R.abs, R.max, R.squeeze).

But R.absmax itself helps to reduce memory in the following way, example:
Given R.linear op has the following input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
I do not save the whole tensor, after R.absmax I get 1D tensor with 3 elements: [10, 20, 30]. And I dump only this 1D vector. As a result, memory reduce is 3x times (9 vs 3 elements).

This toy example demonstrates how R.absmax can help reduce memory consumption.

funcs: List[str],
dataset: List[tvm.nd.NDArray],
config: Dict[str, Any],
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to implement a fairly standard calibration process. Is it possible to apply this without _smooth?

The goal is to demonstrate the accuracy improvement brought by smoothing, by running _calibration with or without smoothing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible. For quick testing I usually run only _smooth without _calibrate.
I need to double check some passes because I use pattern matching there. But by design it should work (_calibration without _smooth).

print("[SmoothQuant] Run smoothing...")
mod = _smooth(mod, params, model_names, dataset, smq_config)
print("[SmoothQuant] Run calibration and quantization...")
mod = _calibrate(mod, params, model_names, dataset, smq_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to collect stats twice? Can't we use the first stats + smoothing scale to calculate smoothed max?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: for current quantization scheme (per-tensor) we can call it once.

In the previous comment I mentioned that I dump preprocessed tensors (R.absmax, for memory footprint optimization). For per-channel or per-token quantization we need call it twice. Or we need to dump the whole tensor, but I do not think it is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-channel or per-token quantization we need call it twice

Why is that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To demonstrate this here is the same example from above:
We have R.linear op and input data (activations, 2D tensor):
[[10, 1, 1],
[1, 20, 1],
[1, 1, 30]]
In the current implementation I use R.absmax and dump 1D tensor: [10, 20, 30].
But for groupwise, per-token (or per-channel) quantization scheme looks like this 1D tensor is not enough to calculate all scales. As I understand we need the whole 2D tensor.

So, why do I like this op so much... :)
From my real life experiments with LLM on A10g GPU I met cases when I do not fit into 22 GB of memory. As a result program crashed.
So, I decided to focus on memory optimization instead of calibration time optimization. So this way R.absmax was born... :)
But as I sad, for per-tensor Q scheme this 1D vector is enough to calculate scale. And we can collect stats once.

I would like to say the following:
If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization).
What do you think?

Copy link
Contributor

@masahi masahi Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this issue is critical (collect stats twice) I can invest part of my time and try to fix it (single run + memory optimization)

Given that you are only supporting per-tensor for now and that memory optimization is only done, I assume that single run + memory optimization doesn't require any non-trivial engineering. So I'd say why not.

That said, I still don't understand why we ever need to collect twice, even for per channel / token. We can keep track of "running max" as we send new data to the model. After all data are consumed, we are left with per-channel or token "max vector", obtained by one pass. We don't have to dump anything else.

@ibsidorenko
Copy link
Author

@masahi Thank you for review! All questions are reasonable and require clarification. I will address it one by one soon.

act_scale = wildcard()
w_scale = wildcard()
lhs_sm = is_op("relax.annotate.smooth")(wildcard(), act_scale).has_attr(attrs)
rhs_sm = is_op("relax.annotate.smooth")(wildcard(), w_scale).has_attr(attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace them with a proper quantize op.

mlc_llm/utils.py Outdated

mod_transform = relax.transform.DeadCodeElimination(transform_func_names)(mod_transform)
mod_deploy = relax.transform.DeadCodeElimination(model_names)(mod_deploy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add

mod_deploy = mod_deploy.with_attrs(
    {
        "external_mods": mod.get_attr("external_mods"),
        "const_name_to_constant": mod.get_attr("const_name_to_constant"),
    }
)

to support BYOC.

@ibsidorenko
Copy link
Author

ibsidorenko commented Sep 19, 2023

Main follow ups after discussion:

  • Remove usage of R.absmax op (use chain of simple ops: abs, max)
  • Remove R.smooth op: use multiply or qnn.quantize (depends on PR#15772)
  • Use single run for statistics collection

ibsidorenko and others added 3 commits December 7, 2023 18:04
This commit fixes issue with adding of new outputs in the Dataflow block
in the SmoothQuantStatCollector pass.
This commit adds support of gsm8k dataset for calibration.

Co-authored-by: Ailurus1 <kozulin02@yandex.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants