Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA swapping at runtime #259

Open
BHX2 opened this issue May 1, 2024 · 10 comments
Open

LoRA swapping at runtime #259

BHX2 opened this issue May 1, 2024 · 10 comments
Labels
models Additions to model or architectures new feature New feature or request

Comments

@BHX2
Copy link

BHX2 commented May 1, 2024

A feature allowing swapping LoRA adapters at runtime could reduce the overhead for running multiple specialized model adapters. This style could either facilitate serving different models to individual users (akin to predibase/lorax) or building a Composition of Experts that allows for programmatic control of routing to adapters.

@EricLBuehler EricLBuehler added new feature New feature or request models Additions to model or architectures labels May 1, 2024
@EricLBuehler
Copy link
Owner

Hi @BHX2, thank you for raising this.

I am considering 2 options for implementing this and wanted your opinion:

  1. At startup time, tell mistral.rs which adapters that can be used, and which will be used. Then, at runtime, adapters can be dynamically scaled with the LoRA alpha value.
  • The benefit is that it avoids the large temporal cost of loading from disk and downloading the adapters and makes applications such as serving different models and CoE feasible.
  • However, it increases memory usage because all possible adapters must be kept in memory.
  1. At startup time, only tell mistral.rs which adapters are currently being used. Then at runtime, adapters can be loaded from disk.
  • The advantage here is increased flexibility and reduced latent memory footprint, although I doubt that users will not know in advance what the set of adapters is.
  • The key disadvantage is loading the adapters from disk.

Both methods require a merge + unmerge cycle for the weights. This is because we currently merge the LoRA weights into the frozen weights to remove the overhead of LoRA. Therefore, I propose a third option:

  1. If the user tells mistral.rs to allow dynamically loading adapters, start by not merging the weights. The user can even provide a set of adapters to pre-load into device memory. Upon activation of a set of adapters, they are re-enabled. If an adapter which is not pre-loaded is requested to be activated, then it is downloaded + loaded.
  • The advantage is high flexibility due to combining 1+2 (choose the memory footprint tradeoff with swapping speed)
  • The disadvantage is the disabling of weight merging in this case, although that should be a small cost.

What are your thoughts on the best method for implementing this? I think that 3 is the best option all-around, but if you have any other ideas please let me know!

@BHX2
Copy link
Author

BHX2 commented May 2, 2024

I agree, option 3 seems the most flexible although if option 1 is simpler to implement into the current codebase then it also seems good.

If the merging & unmerging needs to happen every request to a different LoRA that seems taxing. I'm not sure how X-LoRA dynamically scales the LoRAs per token but I was hoping there'd be some way to basically scale one LoRA at a time to full use for an entire response, and then be able to change it to another adapter on the next request-response cycle with the least friction. The use I have in mind involves a relatively small number of models that would be known at startup. Rather than having a trained router and token-level granularity which X-LoRA seems particularly suited for, I'm interested to see if I could use programmatic logic and a huggingface/setfit model (trainable with few examples) to route messages to use different LoRAs.

Training LoRAs individually to do specific tasks that can be tested and iterated separately seems useful. I'm surprised that LoRAs haven't taken off the Llama and Mistral the way they have the StableDiffusion. Maybe instruction-tuned models are good enough for most uses right now, but there are burgeoning specialization niches: natural-language to SQL translation, agentic planning, creative writing, RAG, etc. I appreciate your research / engineering work which is furthering potential for a LoRA bloom with LLMs!

@LLukas22
Copy link
Contributor

LLukas22 commented May 2, 2024

In my oppinion 3 would also be the best option, especially if it would be possible to define the adapters to use per request to the server. (This could be problematic when multiple requests with different adapter selections come in at the same time 🤔 )

This way i could finetune some "expert" adapters in different domains and simply swap them in / let the user decide which one he wants to use for his request.

@EricLBuehler
Copy link
Owner

Great, thanks for your feedback. I think I will add the preloading to the ordering file and then expose the activation api in the HTTP request or the Request objects for the Rust/Python apis.

@EricLBuehler
Copy link
Owner

@BHX2, @LLukas22: I have a working implementation in #262 (on the lora_swapping branch) of LoRA swapping at runtime. Currently, the only missing feature is that there is no way to run different adapters for different requests in the same batch, but I think this is a good start. Swapping is currently controlled by an API to the MistralRs, so it is not pre-request yet. I plan on adding this soon, but the current implementation is good enough for testing already.

Would your expected use case be made easier by allowing the per-request activation?

Here are some examples of the new API introduced in the PR.

@LLukas22
Copy link
Contributor

LLukas22 commented May 8, 2024

@EricLBuehler

Thanks for adding this. My main usecase of mistral.rs is using it as an async server alternative to ollama and i can only provide my opinions on the server implementation.

Would your expected use case be made easier by allowing the per-request activation?

Well since i deal with a multi-user scenarious it would make things a lot easier if the adapters could be defined per request.
As user A shouldn't have to care about which adapter user B recently loaded and vice versa.

Currently, the only missing feature is that there is no way to run different adapters for different requests in the same batch, but I think this is a good start

I think processing multiple different adapters in a single batch is a bit overkill (but it would be nice if the implementation isn't to complicated). For now we could simply fall back to queuing the different adapters and then process them one after another, if that would be simpler.

@EricLBuehler
Copy link
Owner

I think processing multiple different adapters in a single batch is a bit overkill (but it would be nice if the implementation isn't to complicated). For now we could simply fall back to queuing the different adapters and then process them one after another, if that would be simpler.

Processing multiple different adapters in a single batch is hard, I think. #273 will allow scheduling requests with different adapters more fairly, so this should be possible.

@BHX2
Copy link
Author

BHX2 commented May 8, 2024

This is looking like a great start! I saw in the documentation example it's basically using XLora ordering file. I'm assuming the trained XLora gating head isn't actually needed. For my use case per-request would also be ideal and batching seems like a fancy bonus but not necessary at least at this point.

@EricLBuehler
Copy link
Owner

We use the adapter model ordering file here. It can be used for both LoRA and X-LoRA, so when using LoRA it does not load the classifier. I'll be adding the per-request activation tonight, but the basic API is already done and I expect to merge soon.

@EricLBuehler
Copy link
Owner

@BHX2, @LLukas22 I just merged #262! You can use per-request LoRA activation which in all APIs. After setting up your adapter model ordering file, you can try it out: examples and docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants