The case for making /slots
and other monitoring endpoints stable
#11040
mcharytoniuk
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
IIRC the main reason to deem |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
I am coming from the perspective of Paddler.
Previously, I've been nagging you about some notification mechanisms for changes in the server API, and you created this issue to monitor, for which I am extremely grateful: #9291 😃
Now, I bring up another issue to discuss, the slots:
We have been using llama.cpp behind a load balancer for some time now and it works well, I think it starts to stabilize overall and actually can be used in prod, even though llama.cpp API is still a bit unstable. Also, the balancing mechanism based on slots was a good idea overall.
My case is: I think llama.cpp should have some stable endpoints for exposing some of its internals through the server, so tools like Paddler and other monitoring infra can plug into it without embedding the entire llama.cpp into the project. If that is not an issue, maybe we can start with
/slots
? So far, we need the number of free/occupied slots configured with--parallel
.We are also working on a tool to deploy and manage entire fleets of
llama.cpp
instances, so this is really important for us to have that kind of monitoring endpoints. If not REST, maybe we can add some kind of IPC? (just throwing around some ideas)If you are fine with adding that, I will be happy to help and maintain such endpoints; let us first figure out something to help both llama.cpp and the ecosystem.
Thank you for llama.cpp!
Beta Was this translation helpful? Give feedback.
All reactions