Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl #11166

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

@jeffbolznv jeffbolznv commented Jan 9, 2025

Shaders are based on cpy.cu.

For #11127.

This supports the same set of quants to be converted from f32 as CUDA. Looks like CUDA also supports OP_CPY for Q8_0 to F32, and for any quant to itself. I don't know if those are required, but they wouldn't be hard to add if so.

I haven't done any perf testing of these. CUDA is also using one thread per CTA, which sounds kind of slow but maybe it's not a perf-critical operation. In fact, the only testing I've done is test-backend-ops. I'll try to pull this into stable-diffusion.cpp to test.

CC @stduhpf

@jeffbolznv jeffbolznv requested a review from 0cc4m January 9, 2025 20:52
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 9, 2025
@jeffbolznv
Copy link
Collaborator Author

stable_diffusion.cpp "works" with this change, but loading the models is way slower and seems to be doing the quantization on the CPU. @stduhpf any idea?

@stduhpf
Copy link
Contributor

stduhpf commented Jan 9, 2025

Thanks a lot! It seems to work very well for supported types.

loading the models is way slower and seems to be doing the quantization on the CPU

Are you talking about quantization of the LoRA or of the base model? Because if you're talking about the base model, I think this is the expected behaviour even without those changes, even though it would be nice to quantize using GPU now that the shaders exist for it. It's even single-threaded, so it takes forever with larger models.

I didn't notice any slowness loading the LoRA, and it looks like it was using the GPU as expected.

@jeffbolznv
Copy link
Collaborator Author

Yeah, I think it was the base model. Thanks for clarifying.

@slaren
Copy link
Collaborator

slaren commented Jan 10, 2025

In llama.cpp, F32 -> Quant is needed for KV quantization, and Quant -> F32 conversion is used for context shifts when quantizing the K cache.

@jeffbolznv
Copy link
Collaborator Author

We also have GET_ROWS supporting dequant. Does context shifting use CPY or GET_ROWS?

@slaren
Copy link
Collaborator

slaren commented Jan 10, 2025

It uses CPY.

@jeffbolznv
Copy link
Collaborator Author

I started implementing q->f32, noticed there weren't backend tests, and when I added them they error out because ggml_compute_forward_dup doesn't support q->f32. Am I missing something?

  CPY(type_src=q4_0,type_dst=f32,ne=[256,4,4,4],permute=[0,0,0,0]): C:\github\jeffbolznv\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.c:3996: fatal error

@github-actions github-actions bot added the testing Everything test related label Jan 10, 2025
@jeffbolznv
Copy link
Collaborator Author

I've gone ahead and implemented the missing ggml-cpu function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants