Matrix Multiplication (Matmul) Improvements:
Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.
- Matmul components by @louisfd in #220
- more precision required by @louisfd in #224
- Matmul/transpose loader by @louisfd in #230
- Matmul: refactor configs by @louisfd in #233
- Matmul/tensor reference by @louisfd in #238
- Matmul: CPU reference with same precisions by @louisfd in #237
- Matmul: batch one_to_many + refactor configuration by @louisfd in #242
- Matmul: slice level by @louisfd in #246
- Matmul: reuse accumulator by @louisfd in #257
- Different seeds for lhs/rhs by @louisfd in #263
- Matmul: customizable cube dispatch by @louisfd in #273
- Matmul: double buffering by @louisfd in #271
- Matmul: minor refactor by @louisfd in #275
- Matmul: small refactoring to allow easier kernel selection by @louisfd in #276
- Matmul: Some refactor by @louisfd in #285
- Matmul: check input bounds by @louisfd in #288
- Matmul: More comptime in accelerated tile by @louisfd in #290
- Matmul: batch broadcast by @louisfd in #306
- Matmul: fix transposed (swizzle) dispatch for non square matrix by @louisfd in #307
- Refactor + Profile Matmul by @nathanielsimard in #292
- Matmul: tilewise loader by @louisfd in #310
- [Feat] Ground work to make GEMM components usable for convolution by @wingertge in #309
- Matmul: kernel select heuristic by @louisfd in #312
- Cast fragment by @nathanielsimard in #311
- Matmul: better error message (minor PR) by @louisfd in #313
- Matmul/auto by @nathanielsimard in #316
- Matmul: pipelined double buffering by @louisfd in #323
- Add Support for cast instruction in hip wmma intrinsic compiler by @syl20bnr in #317
- Return result from matmul launch function by @nathanielsimard in #340
- Some fixes in matmul by @louisfd in #350
- Remove unneeded includes function in WmmaCompiler trait by @syl20bnr in #351
- Improve + refactor matmul by @nathanielsimard in #365
- Fix/matmul plane size 64 by @nathanielsimard in #378
New Crate for Reduce Kernels
This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.
- Implement sum as a reduce for vector by @maxtremblay in #264
- Reduce on cuda by @maxtremblay in #274
- Implement a reduction accross lines by @maxtremblay in #280
- Query num planes by @maxtremblay in #294
- Improve tests for cubecl-reduce by @maxtremblay in #299
- Import reduce naive from burn by @maxtremblay in #314
- Import reduce shared by @maxtremblay in #329
- Implement a plane reduction by @maxtremblay in #336
- major refactor of reduce by @maxtremblay in #349
- Reduce plane by @maxtremblay in #359
- Reduce shared unit by @maxtremblay in #363
- Reduce shared plane by @maxtremblay in #369
- Merge reduce by @maxtremblay in #402
- Reduce stride by @maxtremblay in #408
Compiler and Runtime Optimizations:
Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.
- Remove leftover dead phi nodes from branch elimination by @wingertge in #225
- Ensure
LoopBreak
is updated when merging blocks by @wingertge in #228 - Perf/cuda fence by @nathanielsimard in #232
- Refactor wgpu with stream by @nathanielsimard in #245
- Add HIP wmma intrinsic compiler by @syl20bnr in #279
- Read many buffers at once by @nathanielsimard in #277
- Replace pointer magic with reinterpret_cast by @maxtremblay in #281
- Use CudaArchitecture struct to mirror HIP implementation by @syl20bnr in #287
- Add support for 64 wavefront size in HIP compiler by @syl20bnr in #282
- Feat: comptime fields by @nathanielsimard in #338
- More comptime support by @nathanielsimard in #344
- Refactor for wgpu v23 compatibility with an example of wgpu device sharing by @AsherJingkongChen in #211
- Port fence and read many buffers at once to HIP runtime by @syl20bnr in #348
- ROCm 6.3.0 HIP bindings update by @syl20bnr in #362
- [Feat] SPIR-V debug info by @wingertge in #356
- Make rocwmma compiler the default has it covers more AMD architectures by @syl20bnr in #366
- Disable WMMA compiler on CDNA GPUs and update naming accordingly by @syl20bnr in #367
- Feat/ virtual tensor by @nathanielsimard in #380
- [Feat] Expanded SPIR-V debug info by @wingertge in #368
- Remove HIP context from runtime by @syl20bnr in #375
- Add two of the missing bit operations that have hardware acceleration by @wingertge in #391
- [Feat] Rework allocator by @wingertge in #401
New Functionalities:
Added support for more instructions and better type support.
- Allow for overriding device, return device setup by @ArthurBrussee in #210
- Feat/more types support by @wingertge in #207
- Add subcube and mma support for HIP compiler by @syl20bnr in #219
- Allow creating kernels with no runtime by @wingertge in #229
- Add topology properties to client by @wingertge in #244
- [Fix] Merge vectorized tf32 with f32 by @wingertge in #253
- Subcube elect by @maxtremblay in #259
- feat: bitwise ops implementation for line by @quinton11 in #284
- [Feat] Make atomics generic and add comprehensive support for different types by @wingertge in #406
- patch: extending the int trait with not ops by @quinton11 in #411
- feat: not-trait-impl-for-int-types by @quinton11 in #412
Bug Fixes
Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.
- Force the staging buffer in
read
to be aligned by @wingertge in #208 - Fix visibility of impl fns getting dropped by the cube macro by @jbelanich in #212
- Fix asynchronous autotuning for wasm by @ArthurBrussee in #213
- Fix setup for SPIR-V by @wingertge in #215
- Fix/wgsl extension by @nathanielsimard in #221
- Fix device creation for wasm by @ArthurBrussee in #218
- Fix sign conversion in SPIR-V by @wingertge in #223
- Fix: Autotune cache miss by @nathanielsimard in #235
- Refactor metadata to make it more simple and flexible by @wingertge in #236
- Fix exclusive memory only and custom wgsl kernels by @wingertge in #240
- Fix sync with CUDA by @nathanielsimard in #247
- Write enable subgroups on the web by @ArthurBrussee in #251
- Fix autotune compilation on WASM by @ArthurBrussee in #252
- Add max_bindings hardware limit by @ArthurBrussee in #256
- Fix/wgpu queue by @nathanielsimard in #262
- Fix optimizer bug with index merge by @wingertge in #265
- Add missing mma.h include in wmma CUDA compiler by @syl20bnr in #289
- Propagate kernel name via KernelSettings by @torsteingrindvik in #278
- Fix hardware properties for WebGPU by @ArthurBrussee in #300
- Consistent OOB behaviour for wgpu by @ArthurBrussee in #296
- Fix/bool cast by @wingertge in #303
- Add lineinfo flag to CUDA source compilation options by @torsteingrindvik in #305
- Fix CMMA stride on SPIR-V by @wingertge in #319
- Fix reserved kernel variable name. by @nathanielsimard in #318
- Fix Clippy + CI by @nathanielsimard in #321
- Fix cuda fragment cast bf16 by @nathanielsimard in #320
- fix infinite loop by @nathanielsimard in #325
- Fix matmul simple by @nathanielsimard in #326
- Fix xtask command with last version by @syl20bnr in #327
- Fix/ci by @nathanielsimard in #331
- Fix errors on system with multiple AMD CDNA GPUs by @syl20bnr in #337
- Fix test_buffer_len_offset test on MacOS by @syl20bnr in #342
- Fix find_vectorization broadcasting forline size 1 by @laggui in #347
- Fix constant arrays in SPIR-V by @wingertge in #346
- Fix include of rocwmma headers when using HIP wmma intrinsic compiler by @syl20bnr in #361
- Make spirv feature to compile on macOS by @syl20bnr in #301
- Don't panic when readback channel is closed by @ArthurBrussee in #352
- Fix HIP runtime sync and alignment to match with CUDA runtime by @syl20bnr in #371
- Add more primitives to list in when parsing Expression. by @torsteingrindvik in #384
- Fix for optimizer panic in some matmul kernels by @wingertge in #392
- Fix wgsl integer format + cpp checked index by @nathanielsimard in #396
- Check for CUDA distribution installation. by @vaijira in #394
- Fix magnitude cuda by @maxtremblay in #398
- Fix Neg operation in CUDA. by @vaijira in #409
- Flush when too many handles are locked by @nathanielsimard in #405
- Fix reduce cube count too big by @nathanielsimard in #423
Refactoring
Significant refactoring of the IR (Intermediate Representation) for cleaner, more maintainable code.
Streamlined autotune processes and simplified the optimizer for better extensibility.
Updated and cleaned up the codebase to align with newer versions of Rust and its ecosystem.
- IR Refactor by @wingertge in #199
- Simplfy autotune, skip if only 1 option by @ArthurBrussee in #196
- Refactor/slice lifetime by @nathanielsimard in #255
- Rename subcube to plane by @louisfd in #258
- add missing argument to cuda compiler in test common by @maxtremblay in #298
- Feat/typemap by @nathanielsimard in #389
- Clean up IR slightly by merging comments into NonSemantic IR type by @wingertge in #390
- Refactor optimizer for maintainability/extensability by @wingertge in #393
Documentation & User Experience
Enhanced error messages, particularly for matrix operations, providing clearer feedback on issues. Added documentation to support users in understanding new features and configurations. Implemented user hints for deriving traits and using extensions in kernel functions.
- Adjust naming and comments by @AsherJingkongChen in #214
- Dot product of numeric lines + minor stuff by @louisfd in #227
- Add Runtime trait method to get default kernel/shader extension by @torsteingrindvik in #267
- Inform users that launch kernels cannot have return types by @torsteingrindvik in #268
- Hint users towards the possibilty of deriving CubeType and CubeLaunch… by @torsteingrindvik in #272
- Kernel names with float suffix by @torsteingrindvik in #302
- Update some CUDA documentation and extract CUDA_MAX_BINDINGS by @Pencilcaseman in #308
- Implement Tensor::coordinate method by @maxtremblay in #315
- Implement all line element-wise comparisons by @maxtremblay in #322
- Added panic for memory partitioning error by @ImTheSquid in #324
- Add a macro to insert comments into kernels by @maxtremblay in #370
- Add Neg trait to Float trait. by @vaijira in #395
- Simplify allocator by @maxtremblay in #388
Full Changelog: v0.3.0...v0.4.0