Serve multiple models with llamacpp server #10431
Replies: 6 comments 1 reply
-
or just start multiple instance lol |
Beta Was this translation helpful? Give feedback.
-
I start each model in a separate server instance using a new port for each model. Then I make an API call to any running server providing the array of messages intended for that model. I can also use the very nice webUI that @ngxson made for any running server! |
Beta Was this translation helpful? Give feedback.
-
I saw this project on r/LocalLLaMA as a workaround: https://github.com/mostlygeek/llama-swap |
Beta Was this translation helpful? Give feedback.
-
I would say either add a proxy server in front of it, or just use a project like llama-swap. Then you can just have all the config options etc per server. The proxy would just need some simple routing, some simple queuing per server and options like trying to start new model as second server (if you have the memory available) and then you can just reuse all server options. |
Beta Was this translation helpful? Give feedback.
-
I haven't tried using it, but there is also this: https://github.com/perk11/large-model-proxy It looks like you can run multiple backend even and then have them unload automatically using least recently used, etc. |
Beta Was this translation helpful? Give feedback.
-
Still seems like a nice feature, if the will to implement it is there. |
Beta Was this translation helpful? Give feedback.
-
Is there a way to serve more than one model at the same time? (I don't think so).
Are there any plans to add this feature?
Beta Was this translation helpful? Give feedback.
All reactions