Matrix Multiplication (Matmul) Improvements:

Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.

Matmul components by @louisfd in #220
more precision required by @louisfd in #224
Matmul/transpose loader by @louisfd in #230
Matmul: refactor configs by @louisfd in #233
Matmul/tensor reference by @louisfd in #238
Matmul: CPU reference with same precisions by @louisfd in #237
Matmul: batch one_to_many + refactor configuration by @louisfd in #242
Matmul: slice level by @louisfd in #246
Matmul: reuse accumulator by @louisfd in #257
Different seeds for lhs/rhs by @louisfd in #263
Matmul: customizable cube dispatch by @louisfd in #273
Matmul: double buffering by @louisfd in #271
Matmul: minor refactor by @louisfd in #275
Matmul: small refactoring to allow easier kernel selection by @louisfd in #276
Matmul: Some refactor by @louisfd in #285
Matmul: check input bounds by @louisfd in #288
Matmul: More comptime in accelerated tile by @louisfd in #290
Matmul: batch broadcast by @louisfd in #306
Matmul: fix transposed (swizzle) dispatch for non square matrix by @louisfd in #307
Refactor + Profile Matmul by @nathanielsimard in #292
Matmul: tilewise loader by @louisfd in #310
[Feat] Ground work to make GEMM components usable for convolution by @wingertge in #309
Matmul: kernel select heuristic by @louisfd in #312
Cast fragment by @nathanielsimard in #311
Matmul: better error message (minor PR) by @louisfd in #313
Matmul/auto by @nathanielsimard in #316
Matmul: pipelined double buffering by @louisfd in #323
Add Support for cast instruction in hip wmma intrinsic compiler by @syl20bnr in #317
Return result from matmul launch function by @nathanielsimard in #340
Some fixes in matmul by @louisfd in #350
Remove unneeded includes function in WmmaCompiler trait by @syl20bnr in #351
Improve + refactor matmul by @nathanielsimard in #365
Fix/matmul plane size 64 by @nathanielsimard in #378

New Crate for Reduce Kernels

This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.

Implement sum as a reduce for vector by @maxtremblay in #264
Reduce on cuda by @maxtremblay in #274
Implement a reduction accross lines by @maxtremblay in #280
Query num planes by @maxtremblay in #294
Improve tests for cubecl-reduce by @maxtremblay in #299
Import reduce naive from burn by @maxtremblay in #314
Import reduce shared by @maxtremblay in #329
Implement a plane reduction by @maxtremblay in #336
major refactor of reduce by @maxtremblay in #349
Reduce plane by @maxtremblay in #359
Reduce shared unit by @maxtremblay in #363
Reduce shared plane by @maxtremblay in #369
Merge reduce by @maxtremblay in #402
Reduce stride by @maxtremblay in #408

Compiler and Runtime Optimizations:

Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.

Remove leftover dead phi nodes from branch elimination by @wingertge in #225
Ensure LoopBreak is updated when merging blocks by @wingertge in #228
Perf/cuda fence by @nathanielsimard in #232
Refactor wgpu with stream by @nathanielsimard in #245
Add HIP wmma intrinsic compiler by @syl20bnr in #279
Read many buffers at once by @nathanielsimard in #277
Replace pointer magic with reinterpret_cast by @maxtremblay in #281
Use CudaArchitecture struct to mirror HIP implementation by @syl20bnr in #287
Add support for 64 wavefront size in HIP compiler by @syl20bnr in #282
Feat: comptime fields by @nathanielsimard in #338
More comptime support by @nathanielsimard in #344
Refactor for wgpu v23 compatibility with an example of wgpu device sharing by @AsherJingkongChen in #211
Port fence and read many buffers at once to HIP runtime by @syl20bnr in #348
ROCm 6.3.0 HIP bindings update by @syl20bnr in #362
[Feat] SPIR-V debug info by @wingertge in #356
Make rocwmma compiler the default has it covers more AMD architectures by @syl20bnr in #366
Disable WMMA compiler on CDNA GPUs and update naming accordingly by @syl20bnr in #367
Feat/ virtual tensor by @nathanielsimard in #380
[Feat] Expanded SPIR-V debug info by @wingertge in #368
Remove HIP context from runtime by @syl20bnr in #375
Add two of the missing bit operations that have hardware acceleration by @wingertge in #391
[Feat] Rework allocator by @wingertge in #401

New Functionalities:

Added support for more instructions and better type support.

Allow for overriding device, return device setup by @ArthurBrussee in #210
Feat/more types support by @wingertge in #207
Add subcube and mma support for HIP compiler by @syl20bnr in #219
Allow creating kernels with no runtime by @wingertge in #229
Add topology properties to client by @wingertge in #244
[Fix] Merge vectorized tf32 with f32 by @wingertge in #253
Subcube elect by @maxtremblay in #259
feat: bitwise ops implementation for line by @quinton11 in #284
[Feat] Make atomics generic and add comprehensive support for different types by @wingertge in #406
patch: extending the int trait with not ops by @quinton11 in #411
feat: not-trait-impl-for-int-types by @quinton11 in #412

Bug Fixes

Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.

Force the staging buffer in read to be aligned by @wingertge in #208
Fix visibility of impl fns getting dropped by the cube macro by @jbelanich in #212
Fix asynchronous autotuning for wasm by @ArthurBrussee in #213
Fix setup for SPIR-V by @wingertge in #215
Fix/wgsl extension by @nathanielsimard in #221
Fix device creation for wasm by @ArthurBrussee in #218
Fix sign conversion in SPIR-V by @wingertge in #223
Fix: Autotune cache miss by @nathanielsimard in #235
Refactor metadata to make it more simple and flexible by @wingertge in #236
Fix exclusive memory only and custom wgsl kernels by @wingertge in #240
Fix sync with CUDA by @nathanielsimard in #247
Write enable subgroups on the web by @ArthurBrussee in #251
Fix autotune compilation on WASM by @ArthurBrussee in #252
Add max_bindings hardware limit by @ArthurBrussee in #256
Fix/wgpu queue by @nathanielsimard in #262
Fix optimizer bug with index merge by @wingertge in #265
Add missing mma.h include in wmma CUDA compiler by @syl20bnr in #289
Propagate kernel name via KernelSettings by @torsteingrindvik in #278
Fix hardware properties for WebGPU by @ArthurBrussee in #300
Consistent OOB behaviour for wgpu by @ArthurBrussee in #296
Fix/bool cast by @wingertge in #303
Add lineinfo flag to CUDA source compilation options by @torsteingrindvik in #305
Fix CMMA stride on SPIR-V by @wingertge in #319
Fix reserved kernel variable name. by @nathanielsimard in #318
Fix Clippy + CI by @nathanielsimard in #321
Fix cuda fragment cast bf16 by @nathanielsimard in #320
fix infinite loop by @nathanielsimard in #325
Fix matmul simple by @nathanielsimard in #326
Fix xtask command with last version by @syl20bnr in #327
Fix/ci by @nathanielsimard in #331
Fix errors on system with multiple AMD CDNA GPUs by @syl20bnr in #337
Fix test_buffer_len_offset test on MacOS by @syl20bnr in #342
Fix find_vectorization broadcasting forline size 1 by @laggui in #347
Fix constant arrays in SPIR-V by @wingertge in #346
Fix include of rocwmma headers when using HIP wmma intrinsic compiler by @syl20bnr in #361
Make spirv feature to compile on macOS by @syl20bnr in #301
Don't panic when readback channel is closed by @ArthurBrussee in #352
Fix HIP runtime sync and alignment to match with CUDA runtime by @syl20bnr in #371
Add more primitives to list in when parsing Expression. by @torsteingrindvik in #384
Fix for optimizer panic in some matmul kernels by @wingertge in #392
Fix wgsl integer format + cpp checked index by @nathanielsimard in #396
Check for CUDA distribution installation. by @vaijira in #394
Fix magnitude cuda by @maxtremblay in #398
Fix Neg operation in CUDA. by @vaijira in #409
Flush when too many handles are locked by @nathanielsimard in #405
Fix reduce cube count too big by @nathanielsimard in #423

Refactoring

Significant refactoring of the IR (Intermediate Representation) for cleaner, more maintainable code.
Streamlined autotune processes and simplified the optimizer for better extensibility.
Updated and cleaned up the codebase to align with newer versions of Rust and its ecosystem.

IR Refactor by @wingertge in #199
Simplfy autotune, skip if only 1 option by @ArthurBrussee in #196
Refactor/slice lifetime by @nathanielsimard in #255
Rename subcube to plane by @louisfd in #258
add missing argument to cuda compiler in test common by @maxtremblay in #298
Feat/typemap by @nathanielsimard in #389
Clean up IR slightly by merging comments into NonSemantic IR type by @wingertge in #390
Refactor optimizer for maintainability/extensability by @wingertge in #393

Documentation & User Experience

Enhanced error messages, particularly for matrix operations, providing clearer feedback on issues. Added documentation to support users in understanding new features and configurations. Implemented user hints for deriving traits and using extensions in kernel functions.

Adjust naming and comments by @AsherJingkongChen in #214
Dot product of numeric lines + minor stuff by @louisfd in #227
Add Runtime trait method to get default kernel/shader extension by @torsteingrindvik in #267
Inform users that launch kernels cannot have return types by @torsteingrindvik in #268
Hint users towards the possibilty of deriving CubeType and CubeLaunch… by @torsteingrindvik in #272
Kernel names with float suffix by @torsteingrindvik in #302
Update some CUDA documentation and extract CUDA_MAX_BINDINGS by @Pencilcaseman in #308
Implement Tensor::coordinate method by @maxtremblay in #315
Implement all line element-wise comparisons by @maxtremblay in #322
Added panic for memory partitioning error by @ImTheSquid in #324
Add a macro to insert comments into kernels by @maxtremblay in #370
Add Neg trait to Float trait. by @vaijira in #395
Simplify allocator by @maxtremblay in #388

Full Changelog: v0.3.0...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0