New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA swapping at runtime #259
Comments
Hi @BHX2, thank you for raising this. I am considering 2 options for implementing this and wanted your opinion:
Both methods require a merge + unmerge cycle for the weights. This is because we currently merge the LoRA weights into the frozen weights to remove the overhead of LoRA. Therefore, I propose a third option:
What are your thoughts on the best method for implementing this? I think that 3 is the best option all-around, but if you have any other ideas please let me know! |
I agree, option 3 seems the most flexible although if option 1 is simpler to implement into the current codebase then it also seems good. If the merging & unmerging needs to happen every request to a different LoRA that seems taxing. I'm not sure how X-LoRA dynamically scales the LoRAs per token but I was hoping there'd be some way to basically scale one LoRA at a time to full use for an entire response, and then be able to change it to another adapter on the next request-response cycle with the least friction. The use I have in mind involves a relatively small number of models that would be known at startup. Rather than having a trained router and token-level granularity which X-LoRA seems particularly suited for, I'm interested to see if I could use programmatic logic and a huggingface/setfit model (trainable with few examples) to route messages to use different LoRAs. Training LoRAs individually to do specific tasks that can be tested and iterated separately seems useful. I'm surprised that LoRAs haven't taken off the Llama and Mistral the way they have the StableDiffusion. Maybe instruction-tuned models are good enough for most uses right now, but there are burgeoning specialization niches: natural-language to SQL translation, agentic planning, creative writing, RAG, etc. I appreciate your research / engineering work which is furthering potential for a LoRA bloom with LLMs! |
In my oppinion 3 would also be the best option, especially if it would be possible to define the adapters to use per request to the server. (This could be problematic when multiple requests with different adapter selections come in at the same time 🤔 ) This way i could finetune some "expert" adapters in different domains and simply swap them in / let the user decide which one he wants to use for his request. |
Great, thanks for your feedback. I think I will add the preloading to the ordering file and then expose the activation api in the HTTP request or the Request objects for the Rust/Python apis. |
@BHX2, @LLukas22: I have a working implementation in #262 (on the Would your expected use case be made easier by allowing the per-request activation? Here are some examples of the new API introduced in the PR. |
Thanks for adding this. My main usecase of
Well since i deal with a multi-user scenarious it would make things a lot easier if the adapters could be defined per request.
I think processing multiple different adapters in a single batch is a bit overkill (but it would be nice if the implementation isn't to complicated). For now we could simply fall back to queuing the different adapters and then process them one after another, if that would be simpler. |
Processing multiple different adapters in a single batch is hard, I think. #273 will allow scheduling requests with different adapters more fairly, so this should be possible. |
This is looking like a great start! I saw in the documentation example it's basically using XLora ordering file. I'm assuming the trained XLora gating head isn't actually needed. For my use case per-request would also be ideal and batching seems like a fancy bonus but not necessary at least at this point. |
We use the adapter model ordering file here. It can be used for both LoRA and X-LoRA, so when using LoRA it does not load the classifier. I'll be adding the per-request activation tonight, but the basic API is already done and I expect to merge soon. |
@BHX2, @LLukas22 I just merged #262! You can use per-request LoRA activation which in all APIs. After setting up your adapter model ordering file, you can try it out: examples and docs. |
A feature allowing swapping LoRA adapters at runtime could reduce the overhead for running multiple specialized model adapters. This style could either facilitate serving different models to individual users (akin to predibase/lorax) or building a Composition of Experts that allows for programmatic control of routing to adapters.
The text was updated successfully, but these errors were encountered: