xe: sdpa: Improve performance of quantization with better alignment and prefetching #2322

umar456 · 2024-12-27T20:56:37Z

Description

This PR improves the performance of the micro SDPA kernel by using prefetching and setting better alignment when generating the microkernels. This change has a significant impact on certain sizes ranging from (0.89x-1.26x) over the original version.

umar456 · 2024-12-27T20:57:32Z

make test
disable device_cpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_graph

src/gpu/intel/ocl/micro_sdpa.cl

umar456 · 2025-01-03T09:02:36Z

make test
disable device_cpu
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_graph

petercad · 2025-01-03T15:51:51Z

src/gpu/intel/ocl/micro_sdpa.cl

+            /* n_sg */ sg_per_wg,
+            /* sg_size */ SUBGROUP_SIZE,
+            /* cache */ LSC_LDCC_L1C_L3C);
+    //return;


Does it improve performance to have the first K tile prefetch here (before loading Q)? IIRC in my earlier testing it was better to delay the first K tile prefetch until after issuing the Q load.

I saw a slight improvement by moving that forward but I want to test with a larger set of examples. I am compiling that and post my results.

src/gpu/intel/ocl/micro_sdpa.cl

petercad · 2025-01-03T15:56:54Z

src/gpu/intel/ocl/micro_sdpa.cl

+                    /* sg_id */ sg_ij,
+                    /* n_sg */ sg_per_wg,
+                    /* sg_size */ SUBGROUP_SIZE,
+                    /* cache */ LSC_LDCC_L1C_L3C);


This tile is so small that it doesn't need cooperative prefetching (hence the earlier simpler logic). Does this change improve performance?

It looked like all subgroups will prefetch the same memory region with the simpler call because all subgroups will be assigned the same sg_id(0). I didn't think the compiler would be able to gate the other subgroups from executing the prefetch operation. Am I interpreting that incorrectly.

I think I saw a (5%) gain for certain sizes but I can test it again and post my results with the rest of the changes.

Right, the original code has all subgroups doing the prefetch. It's a bit of a tradeoff: with all subgroups doing the prefetch, there's likely some additional overhead in LSC keeping track of the outstanding prefetches. On the other hand, with cooperative prefetch, we're just relying on the timing being right, since there's no barrier between this prefetch and the mask load.

But all that said, if cooperative prefetch shows better performance, let's use it.

Here is the difference between the previous version and the current version of the mask prefetch for broadcasted masks.

shape k ks kzp q msk v vs vzp file baseline new_time speedup

–in-shapes 1x1x128x384*abcd 1x1x2x384 1x1x2x384 1x1x384x128 1x1x1x384 1x1x384x128 1x1x384x2 1x1x384x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.03088 0.02864 1.0782123

–in-shapes 1x1x128x384*abdc 1x1x2x384 1x1x2x384 1x1x384x128 1x1x1x384 1x1x384x128 1x1x384x2 1x1x384x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.02752 0.02752 1.

–in-shapes 1x1x128x512*abcd 1x1x2x512 1x1x2x512 1x1x512x128 1x1x1x512 1x1x512x128 1x1x512x2 1x1x512x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.03184 0.03056 1.0418848

–in-shapes 1x1x128x512*abdc 1x1x2x512 1x1x2x512 1x1x512x128 1x1x1x512 1x1x512x128 1x1x512x2 1x1x512x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.0312 0.02976 1.0483871

–in-shapes 1x1x128x1024*abcd 1x1x2x1024 1x1x2x1024 1x1x1024x128 1x1x1x1024 1x1x1024x128 1x1x1024x2 1x1x1024x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.0584 0.05696 1.0252809

–in-shapes 1x1x128x1024*abdc 1x1x2x1024 1x1x2x1024 1x1x1024x128 1x1x1x1024 1x1x1024x128 1x1x1024x2 1x1x1024x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.05632 0.05504 1.0232558

–in-shapes 1x1x128x2048*abcd 1x1x2x2048 1x1x2x2048 1x1x2048x128 1x1x1x2048 1x1x2048x128 1x1x2048x2 1x1x2048x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.11568 0.1192 0.97046980

–in-shapes 1x1x128x2048*abdc 1x1x2x2048 1x1x2x2048 1x1x2048x128 1x1x1x2048 1x1x2048x128 1x1x2048x2 1x1x2048x2 –case=complex_fusion/mha/sdpa-0ks8f16s8-3qf16-wscale-wmask-6vs8f16s8.json 0.10832 0.10608 1.0211161

petercad · 2025-01-03T19:00:34Z

src/gpu/intel/ocl/tile_ops.h

-    const uint cl_per_sg = (cl + n_sg - 1) / n_sg;
-    const uint cl_iters = (cl_per_sg + sg_size - 1) / sg_size;
+    const uint cl_per_sg = (cl + sg_size - 1) / sg_size;
+    const uint cl_iters = (cl_per_sg + n_sg - 1) / n_sg;


Can you explain what's going on this patch? This doesn't look right.

cl_per_sg was using the number of subgroups instead of the subgroup size to calculate the cache lines per sg.

The main difference in this commit is that multiple subgroups were prefetching the same memory region because the i_cl indexing was not offsetting across subgroups. This increased the number of iterations and I believe some cache lines were skipped because of this.

cl_per_sg was using the number of subgroups instead of the subgroup size to calculate the cache lines per sg.

The existing code has the expected behavior. We're gathering up all the cache lines to prefetch (cl) then splitting them among the n_sg available subgroups, leaving cl_per_sg cache lines per subgroup. Then, we're splitting up cl_per_sg between work-items in the subgroup.

The main difference in this commit is that multiple subgroups were prefetching the same memory region because the i_cl indexing was not offsetting across subgroups

Ah yes, it wasn't offsetting properly, thanks for catching that. I think you can quickly fix it by reverting the patch to these lines (617-618) and applying the patch I suggested below.

petercad · 2025-01-03T19:46:19Z

src/gpu/intel/ocl/tile_ops.h

+        uint i_cl = ii_cl * cl_per_sg * sg_size + (sg_id * sg_size)
+                + get_sub_group_local_id();


Suggested change

uint i_cl = ii_cl * cl_per_sg * sg_size + (sg_id * sg_size)

+ get_sub_group_local_id();

uint i_cl = (ii_cl + (sg_id * cl_per_sg)) * sg_size + get_sub_group_local_id();

Shouldn't it be:

Suggested change

uint i_cl = ii_cl * cl_per_sg * sg_size + (sg_id * sg_size)

+ get_sub_group_local_id();

uint i_cl = (ii_cl * cl_per_sg + sg_id) * sg_size + get_sub_group_local_id();

Otherwise the second iteration will only offset by sg_size which will overlap with iteration zero of subgroup one.

umar456 added performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel labels Dec 27, 2024

umar456 requested a review from a team as a code owner December 27, 2024 20:56

petercad reviewed Jan 2, 2025

View reviewed changes

src/gpu/intel/ocl/micro_sdpa.cl Outdated Show resolved Hide resolved

umar456 force-pushed the uarshad/sdpa_scale_zp_alignment branch from 9af6de9 to 4746108 Compare January 3, 2025 09:01