14 Jan 20:36

8f3b9f3

v0.4.0 Latest

Latest

Matrix Multiplication (Matmul) Improvements:

Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.

Matmul components by @louisfd in #220
more precision required by @louisfd in #224
Matmul/transpose loader by @louisfd in #230
Matmul: refactor configs by @louisfd in #233
Matmul/tensor reference by @louisfd in #238
Matmul: CPU reference with same precisions by @louisfd in #237
Matmul: batch one_to_many + refactor configuration by @louisfd in #242
Matmul: slice level by @louisfd in #246
Matmul: reuse accumulator by @louisfd in #257
Different seeds for lhs/rhs by @louisfd in #263
Matmul: customizable cube dispatch by @louisfd in #273
Matmul: double buffering by @louisfd in #271
Matmul: minor refactor by @louisfd in #275
Matmul: small refactoring to allow easier kernel selection by @louisfd in #276
Matmul: Some refactor by @louisfd in #285
Matmul: check input bounds by @louisfd in #288
Matmul: More comptime in accelerated tile by @louisfd in #290
Matmul: batch broadcast by @louisfd in #306
Matmul: fix transposed (swizzle) dispatch for non square matrix by @louisfd in #307
Refactor + Profile Matmul by @nathanielsimard in #292
Matmul: tilewise loader by @louisfd in #310
[Feat] Ground work to make GEMM components usable for convolution by @wingertge in #309
Matmul: kernel select heuristic by @louisfd in #312
Cast fragment by @nathanielsimard in #311
Matmul: better error message (minor PR) by @louisfd in #313
Matmul/auto by @nathanielsimard in #316
Matmul: pipelined double buffering by @louisfd in #323
Add Support for cast instruction in hip wmma intrinsic compiler by @syl20bnr in #317
Return result from matmul launch function by @nathanielsimard in #340
Some fixes in matmul by @louisfd in #350
Remove unneeded includes function in WmmaCompiler trait by @syl20bnr in #351
Improve + refactor matmul by @nathanielsimard in #365
Fix/matmul plane size 64 by @nathanielsimard in #378

New Crate for Reduce Kernels

This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.

Implement sum as a reduce for vector by @maxtremblay in #264
Reduce on cuda by @maxtremblay in #274
Implement a reduction accross lines by @maxtremblay in #280
Query num planes by @maxtremblay in #294
Improve tests for cubecl-reduce by @maxtremblay in #299
Import reduce naive from burn by @maxtremblay in #314
Import reduce shared by @maxtremblay in #329
Implement a plane reduction by @maxtremblay in #336
major refactor of reduce by @maxtremblay in #349
Reduce plane by @maxtremblay in #359
Reduce shared unit by @maxtremblay in #363
Reduce shared plane by @maxtremblay in #369
Merge reduce by @maxtremblay in #402
Reduce stride by @maxtremblay in #408

Compiler and Runtime Optimizations:

Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.

Remove leftover dead phi nodes from branch elimination by @wingertge in #225
Ensure LoopBreak is updated when merging blocks by @wingertge in #228
Perf/cuda fence by @nathanielsimard in #232
Refactor wgpu with stream by @nathanielsimard in #245
Add HIP wmma intrinsic compiler by @syl20bnr in #279
Read many buffers at once by @nathanielsimard in #277
Replace pointer magic with reinterpret_cast by @maxtremblay in #281
Use CudaArchitecture struct to mirror HIP implementation by @syl20bnr in #287
Add support for 64 wavefront size in HIP compiler by @syl20bnr in #282
Feat: comptime fields by @nathanielsimard in #338
More comptime support by @nathanielsimard in #344
Refactor for wgpu v23 compatibility with an example of wgpu device sharing by @AsherJingkongChen in #211
Port fence and read many buffers at once to HIP runtime by @syl20bnr in #348
ROCm 6.3.0 HIP bindings update by @syl20bnr in #362
[Feat] SPIR-V debug info by @wingertge in #356
Make rocwmma compiler the default has it covers more AMD architectures by @syl20bnr in #366
Disable WMMA compiler on CDNA GPUs and update naming accordingly by @syl20bnr in #367
Feat/ virtual tensor by @nathanielsimard in #380
[Feat] Expanded SPIR-V debug info by @wingertge in #368
Remove HIP context from runtime by @syl20bnr in #375
Add two of the missing bit operations that have hardware acceleration by @wingertge in #391
[Feat] Rework allocator by @wingertge in #401

New Functionalities:

Added support for more instructions and better type support.

Allow for overriding device, return device setup by @ArthurBrussee in #210
Feat/more types support by @wingertge in #207
Add subcube and mma support for HIP compiler by @syl20bnr in #219
Allow creating kernels with no runtime by @wingertge in #229
Add topology properties to client by @wingertge in #244
[Fix] Merge vectorized tf32 with f32 by @wingertge in #253
Subcube elect by @maxtremblay in #259
feat: bitwise ops implementation for line by @quinton11 in #284
[Feat] Make atomics generic and add comprehensive support for different types by @wingertge in #406
patch: extending the int trait with not ops by @quinton11 in #411
feat: not-trait-impl-for-int-types by @quinton11 in #412

Bug Fixes

Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.

Force the staging buffer in read to be aligned by @wingertge in #208
Fix visibility of impl fns getting dropped by the cube macro by @jbelanich in #212
Fix asynchronous autotuning for wasm by @ArthurBrussee in #213
Fix setup for SPIR-V by @wingertge in #215
Fix/wgsl extension by @nathanielsimard in #221
Fix device creation for wasm by @ArthurBrussee in #218
Fix sign conversion in SPIR-V by @wingertge in https://githu...

Contributors

wingertge, vaijira, and 12 other contributors

Assets 2

28 Oct 15:47

nathanielsimard

v0.3.0

929bcf4

v0.3.0

CubeCL v0.3.0 Release Notes

This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.

Language Features

Added support for numeric constants by @booti386 in #112
Added for in syntax for immutable arrays, tensors and slices by @wingertge in #119
Added support for ROCm HIP by @syl20bnr in #183
Added if as a value expression by @wingertge in #120
Added select (ternary) operations by @wingertge in #152
Implemented support for func generics for impl block by @nathanielsimard in #189
Added support for Enum + Const Match by @nathanielsimard in #145
Added support for numeric match at runtime by @wingertge in #143
Added support for comptime arrays available as runtime constants by @wingertge in #147
Added features for each supported datatype by @wingertge in #193
Reimplemented macro to make writing kernels more ergonomic by @wingertge in #80
Clean up macro and optimize branch operations by @wingertge in #118

Runtime Improvements

CUDA

Improved CUDA compiler by @nathanielsimard in #88
Fixed CUDA architecture version by @nathanielsimard in #89
Fixed native vector types by @nathanielsimard in #92
Fixed CUDA support for different ranks by @nathanielsimard in #124
Better CMMA configuration by @nathanielsimard in #146
Support SSA bindings for CUDA by @wingertge in #153
Fixed various CUDA bugs by @nathanielsimard in #168

WGPU

Fixed WGPU memory corruption for CubeCount::Dynamic by @ArthurBrussee in #156
Added support for autotuning on WebGPU, more precise timings by @ArthurBrussee in #167
Fixed overflow when max page == 4GB on WASM by @ArthurBrussee in #194
Merged cubecl-wgpu and cubecl-wgpu-spirv by @wingertge in #184

HIP/ROCm

Added support for ROCm HIP by @syl20bnr in #183
Added half precision support to HIP by @syl20bnr in #201
Limited cubecl-hip for Linux targets only by @syl20bnr in #205

SPIR-V

Added SPIR-V compiler by @wingertge in #155
Fixed casting, powf and alignment for SPIR-V by @wingertge in #188

Optimization & Performance

Added value-based partial redundancy elimination by @wingertge in #169
Added prefetching to into_contiguous by @wingertge in #181
Added block merging by @wingertge in #163
Added round and bitwise or operations by @laggui in #99
Skipped zero initialization of workgroup memory by @ArthurBrussee in #125
CMMA Optimizations:
- CMMA: cube dispatch strategy by @louisfd in #126
- Reuse lhs frag strategy by @louisfd in #132
- Invert k n loops by @louisfd in #131
- Continuous warp loading by @louisfd in #138
- Relative warp IDs by @louisfd in #144
- Relaxed b_m = b_n by @louisfd in #148
- New strategy for num compute planes + many refactors by @louisfd in #150

Infrastructure

Added profiling support by @nathanielsimard in #137
Improved compilation arguments by @nathanielsimard in #141
Added simple benchmarking capabilities by @jbelanich in #190
Added periodic memory cleanup by @ArthurBrussee in #178
Reworked & added ExclusivePages as memory management option by @ArthurBrussee in #158
Fixed concurrency problems with autotune by @nathanielsimard in #200
Improved timing methods for benchmarking by @jbelanich in #190
Fixed CI for Rust 1.82 by @nathanielsimard in #182
Migrated xtask to tracel-xtask by @syl20bnr in #93
Updated CI workflow and badges by @syl20bnr in #96

Math & Operations

Implemented dot product by @RianGoossens in #140
Implemented magnitude by @RianGoossens in #105
Added Round, Floor, Ceil for Line by @med1844 in #179
Implemented Vector Normalization by @RianGoossens in #100
Added round and bitwise operations by @laggui in #99

Documentation & Examples

Added simple fusion example by @nathanielsimard in #142
Updated README by @nathanielsimard in #192
Added book by @nathanielsimard in #133
Format floating point values with maximum precision by @ArthurBrussee in #130