You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, every lora layer would be moved from CPU to target device of base model, results in extra 20ms in each layer, finally 500ms ~ 1+s latency overall.
first loading into CPU memory
def load_module_map(
...
for filename in adapter_filenames:
adapter_weights.update(load_file(filename))
...
moving to GPU device inside load_batched_adapter_weights
Thanks for working on this @thincal! We could probably work around this by keeping weights in the safetensors file rather than loading to CPU as an intermediate step.
Feature request
Currently, every lora layer would be moved from CPU to target device of base model, results in extra 20ms in each layer, finally 500ms ~ 1+s latency overall.
Motivation
Improve the adapter loading performance.
Your contribution
Yes, I will prepare a PR for review.
The text was updated successfully, but these errors were encountered: