Skip to content

Latest commit

 

History

History
34 lines (28 loc) · 1.5 KB

NON_GRANULAR.md

File metadata and controls

34 lines (28 loc) · 1.5 KB

X-LoRA non-granular scalings

A key limitation of the X-LoRA architecture is the need for 2 forward passes of the model per generation step. To trade off model performance for speed, mistral.rs allows the user to reduce the granularity of the scalings by caching them in a technique we call Non Granular Scalings.

How it works

For the first $k$ generation steps, the scalings are calculated normally for each token. However, for the rest of the tokens, it is cached and re-used. In this way, we are able to avoid the second forward pass and the performance is increased significantly. To maintain correctness, enabling non-granular scalings will restrict the engine to processing one sequence at a time.

How to use it

Command line

This can be enabled by passing --tgt-non-granular-index followed by $k$:

./mistralrs_server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora --tgt-non-granular-index 5

Python

Set the tgt_non_granular_index attribute to a non-None value in the Which selection:

from mistralrs import Runner, Which

runner = Runner(
    which=Which.XLoraGGUF(
        tok_model_id=None,  # Automatically determine from ordering file
        quantized_model_id="TheBloke/zephyr-7B-beta-GGUF",
        quantized_filename="zephyr-7b-beta.Q4_0.gguf",
        tokenizer_json=None,
        repeat_last_n=64,
        xlora_model_id="lamm-mit/x-lora",
        order="orderings/xlora-paper-ordering.json",
        tgt_non_granular_index=5,
    )
)

...