Improve the latency of `load_batched_adapter_weights` #433

thincal · 2024-04-22T17:10:49Z

Feature request

Currently, every lora layer would be moved from CPU to target device of base model, results in extra 20ms in each layer, finally 500ms ~ 1+s latency overall.

first loading into CPU memory

def load_module_map(
    ...
    for filename in adapter_filenames:
        adapter_weights.update(load_file(filename))
   ...

moving to GPU device inside load_batched_adapter_weights

lora_a = lora_a.to(base_device, self.dtype)
lora_b = lora_b.to(base_device, self.dtype)

Motivation

Improve the adapter loading performance.

Your contribution

Yes, I will prepare a PR for review.

The text was updated successfully, but these errors were encountered:

tgaddair · 2024-04-22T17:45:54Z

Thanks for working on this @thincal! We could probably work around this by keeping weights in the safetensors file rather than loading to CPU as an intermediate step.

thincal changed the title ~~Improve the latency of load_batched_adapter_weights~~ Improve the latency of load_batched_adapter_weights Apr 22, 2024

tgaddair added the enhancement New feature or request label Apr 22, 2024

thincal linked a pull request Apr 23, 2024 that will close this issue

feat: support lazy loading the lora module for reducing the loading p… #434

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the latency of `load_batched_adapter_weights` #433

Improve the latency of `load_batched_adapter_weights` #433

thincal commented Apr 22, 2024 •

edited

tgaddair commented Apr 22, 2024

Improve the latency of load_batched_adapter_weights #433

Improve the latency of load_batched_adapter_weights #433

Comments

thincal commented Apr 22, 2024 • edited

Feature request

Motivation

Your contribution

tgaddair commented Apr 22, 2024

Improve the latency of `load_batched_adapter_weights` #433

Improve the latency of `load_batched_adapter_weights` #433

thincal commented Apr 22, 2024 •

edited