-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify small-batched weight only quantization #2213
base: main
Are you sure you want to change the base?
Conversation
Update MLP branch with upstream
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <[email protected]>
Found some error cases with unit test cases with small load chunks.
TODO: Increased instructions hide the ShMem advantage
Great work! and I wonder what is your benchmark gemv quantize type? channelwise or groupwise, and 4bit or 8bit? |
These are the result from the |
Thank you for your excellent work. I am the author of the batched GEMV kernel in TRT-LLM. My colleagues and I have reviewed and benchmarked your modifications in this PR. We had previously tried a similar approach, but it didn't yield significant benefits at that time. We validated the kernel latency with your modifications on different shapes on the H100 but found that there was a performance regression in some shapes. Considering that we have other optimization work for this part of the code in progress, we are unable to merge your changes at this time. Could you please provide benchmark data comparing the kernel latency before and after your changes for different shapes (for example, m=1, 2, 3, 4 and n/k=2048, 4096, 8192, 12288, 16384) under the GPTQ/AWQ case on both A100 and H100? |
Hi author, what do you think the idea to use async copy in gemv? gemv is memory bound operation, will async copy boost its performance? |
Yes, in my previous experiments, I came to a similar conclusion. If the tileMNK is not large enough, there might not be sufficient computation and LDS to hide the latency of copy_async. Furthermore, in GEMV cases with small batch sizes, the data often fits within the registers. |
I've found that small-batch weight-only GEMV has suffered from the global memory load stall in some inefficient cases.
This PR uses the shared memory in this case to
It had little or no effect on the small GEMV, but had some effect on the large GEMV.
Below is the percentage of reduced GEMV computation time, sum of the 5 types of GEMV kernel in a first decoding stage.
Tested on single A100 40GB and H100 80GB, batch size 4, context length <= 512.