Replies: 4 comments 23 replies
-
Bump- trying to understand this as well... |
Beta Was this translation helpful? Give feedback.
-
The For example, if we specify
Simply put, if we want to be handling Since
No. Each sequence has it's own context. The tokens from each sequences "see" only the tokens from that same sequence. This is achieved with the Lines 6328 to 6368 in dd5ae06 Each Another great benefit is that different sequences can share a common prompt without any extra compute. All it takes is to assign multiple sequence ids to the common tokens in the KV cache. A basic example is a system prompt of Together with the simplicity and advantages of this implementation, there are a few disadvantages:
In order to resolve these, I think we should add a standard attention implementation where each sequence has it's own KV cache buffer and the attention is computed separately. This way, users would be able to choose which implementation to use based on their specific use case. |
Beta Was this translation helpful? Give feedback.
-
If I understand correctly, I'm told that setting -c beyond model context window size degrades output quality. However, does it make it difference in parallel? |
Beta Was this translation helpful? Give feedback.
-
Is there a standard implementation of KV cache that allows each request to compute its own KV cache independently, rather than sharing one for all requests? I am currently trying to observe the performance of multi-request inference while maintaining a constant KV cache size for each sequence (i.e., actively adjusting n_ctx so that n_ctx / np remains constant). When increasing the value of np, the performance drops sharply. I suspect this is due to the design of the KV cache, which causes the computation of KQV to grow quadratically, rather than linearly. |
Beta Was this translation helpful? Give feedback.
-
Hi All,
I'm seeking clarity on the functionality of the
--parallel
option in/app/server
, especially how it interacts with the--cont-batching
parameter. My specific observation involves setting--ctx-size
to 8192 and--parallel
to 32. From the logs, it appears there are 32 slots, each handling a context segment of 256. My question is: Does this configuration imply that each slot processes a distinct segment of the context?For instance, if I input 32 instances of an identical prompt with a length of 4096, would the first half of the slots remain idle due to the prompt already existing in the KV cache? This leads to confusion, as it seems the total number of submittable jobs is limited by the slot count. This is puzzling because different prompts might rely on the same slot for varied tasks. If the initial segment of the context is identical for multiple prompts, that segment might not require processing, as it's already in the KV cache.
I'm trying to understand the rationale behind dividing the context into segments when batching. Could you provide an explanation of how the
--parallel
and--cont-batching
options function?References:
server.cpp
dividing then_ctx
and callingllama_batch_init(n_ctx, 0, params.n_parallel);
Beta Was this translation helpful? Give feedback.
All reactions