Releases: tracel-ai/cubecl
v0.4.0
Matrix Multiplication (Matmul) Improvements:
Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.
- Matmul components by @louisfd in #220
- more precision required by @louisfd in #224
- Matmul/transpose loader by @louisfd in #230
- Matmul: refactor configs by @louisfd in #233
- Matmul/tensor reference by @louisfd in #238
- Matmul: CPU reference with same precisions by @louisfd in #237
- Matmul: batch one_to_many + refactor configuration by @louisfd in #242
- Matmul: slice level by @louisfd in #246
- Matmul: reuse accumulator by @louisfd in #257
- Different seeds for lhs/rhs by @louisfd in #263
- Matmul: customizable cube dispatch by @louisfd in #273
- Matmul: double buffering by @louisfd in #271
- Matmul: minor refactor by @louisfd in #275
- Matmul: small refactoring to allow easier kernel selection by @louisfd in #276
- Matmul: Some refactor by @louisfd in #285
- Matmul: check input bounds by @louisfd in #288
- Matmul: More comptime in accelerated tile by @louisfd in #290
- Matmul: batch broadcast by @louisfd in #306
- Matmul: fix transposed (swizzle) dispatch for non square matrix by @louisfd in #307
- Refactor + Profile Matmul by @nathanielsimard in #292
- Matmul: tilewise loader by @louisfd in #310
- [Feat] Ground work to make GEMM components usable for convolution by @wingertge in #309
- Matmul: kernel select heuristic by @louisfd in #312
- Cast fragment by @nathanielsimard in #311
- Matmul: better error message (minor PR) by @louisfd in #313
- Matmul/auto by @nathanielsimard in #316
- Matmul: pipelined double buffering by @louisfd in #323
- Add Support for cast instruction in hip wmma intrinsic compiler by @syl20bnr in #317
- Return result from matmul launch function by @nathanielsimard in #340
- Some fixes in matmul by @louisfd in #350
- Remove unneeded includes function in WmmaCompiler trait by @syl20bnr in #351
- Improve + refactor matmul by @nathanielsimard in #365
- Fix/matmul plane size 64 by @nathanielsimard in #378
New Crate for Reduce Kernels
This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.
- Implement sum as a reduce for vector by @maxtremblay in #264
- Reduce on cuda by @maxtremblay in #274
- Implement a reduction accross lines by @maxtremblay in #280
- Query num planes by @maxtremblay in #294
- Improve tests for cubecl-reduce by @maxtremblay in #299
- Import reduce naive from burn by @maxtremblay in #314
- Import reduce shared by @maxtremblay in #329
- Implement a plane reduction by @maxtremblay in #336
- major refactor of reduce by @maxtremblay in #349
- Reduce plane by @maxtremblay in #359
- Reduce shared unit by @maxtremblay in #363
- Reduce shared plane by @maxtremblay in #369
- Merge reduce by @maxtremblay in #402
- Reduce stride by @maxtremblay in #408
Compiler and Runtime Optimizations:
Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.
- Remove leftover dead phi nodes from branch elimination by @wingertge in #225
- Ensure
LoopBreak
is updated when merging blocks by @wingertge in #228 - Perf/cuda fence by @nathanielsimard in #232
- Refactor wgpu with stream by @nathanielsimard in #245
- Add HIP wmma intrinsic compiler by @syl20bnr in #279
- Read many buffers at once by @nathanielsimard in #277
- Replace pointer magic with reinterpret_cast by @maxtremblay in #281
- Use CudaArchitecture struct to mirror HIP implementation by @syl20bnr in #287
- Add support for 64 wavefront size in HIP compiler by @syl20bnr in #282
- Feat: comptime fields by @nathanielsimard in #338
- More comptime support by @nathanielsimard in #344
- Refactor for wgpu v23 compatibility with an example of wgpu device sharing by @AsherJingkongChen in #211
- Port fence and read many buffers at once to HIP runtime by @syl20bnr in #348
- ROCm 6.3.0 HIP bindings update by @syl20bnr in #362
- [Feat] SPIR-V debug info by @wingertge in #356
- Make rocwmma compiler the default has it covers more AMD architectures by @syl20bnr in #366
- Disable WMMA compiler on CDNA GPUs and update naming accordingly by @syl20bnr in #367
- Feat/ virtual tensor by @nathanielsimard in #380
- [Feat] Expanded SPIR-V debug info by @wingertge in #368
- Remove HIP context from runtime by @syl20bnr in #375
- Add two of the missing bit operations that have hardware acceleration by @wingertge in #391
- [Feat] Rework allocator by @wingertge in #401
New Functionalities:
Added support for more instructions and better type support.
- Allow for overriding device, return device setup by @ArthurBrussee in #210
- Feat/more types support by @wingertge in #207
- Add subcube and mma support for HIP compiler by @syl20bnr in #219
- Allow creating kernels with no runtime by @wingertge in #229
- Add topology properties to client by @wingertge in #244
- [Fix] Merge vectorized tf32 with f32 by @wingertge in #253
- Subcube elect by @maxtremblay in #259
- feat: bitwise ops implementation for line by @quinton11 in #284
- [Feat] Make atomics generic and add comprehensive support for different types by @wingertge in #406
- patch: extending the int trait with not ops by @quinton11 in #411
- feat: not-trait-impl-for-int-types by @quinton11 in #412
Bug Fixes
Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.
- Force the staging buffer in
read
to be aligned by @wingertge in #208 - Fix visibility of impl fns getting dropped by the cube macro by @jbelanich in #212
- Fix asynchronous autotuning for wasm by @ArthurBrussee in #213
- Fix setup for SPIR-V by @wingertge in #215
- Fix/wgsl extension by @nathanielsimard in #221
- Fix device creation for wasm by @ArthurBrussee in #218
- Fix sign conversion in SPIR-V by @wingertge in https://githu...
v0.3.0
CubeCL v0.3.0 Release Notes
This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.
Language Features
- Added support for numeric constants by @booti386 in #112
- Added
for in
syntax for immutable arrays, tensors and slices by @wingertge in #119 - Added support for ROCm HIP by @syl20bnr in #183
- Added if as a value expression by @wingertge in #120
- Added select (ternary) operations by @wingertge in #152
- Implemented support for func generics for impl block by @nathanielsimard in #189
- Added support for Enum + Const Match by @nathanielsimard in #145
- Added support for numeric match at runtime by @wingertge in #143
- Added support for comptime arrays available as runtime constants by @wingertge in #147
- Added features for each supported datatype by @wingertge in #193
- Reimplemented macro to make writing kernels more ergonomic by @wingertge in #80
- Clean up macro and optimize branch operations by @wingertge in #118
Runtime Improvements
CUDA
- Improved CUDA compiler by @nathanielsimard in #88
- Fixed CUDA architecture version by @nathanielsimard in #89
- Fixed native vector types by @nathanielsimard in #92
- Fixed CUDA support for different ranks by @nathanielsimard in #124
- Better CMMA configuration by @nathanielsimard in #146
- Support SSA bindings for CUDA by @wingertge in #153
- Fixed various CUDA bugs by @nathanielsimard in #168
WGPU
- Fixed WGPU memory corruption for CubeCount::Dynamic by @ArthurBrussee in #156
- Added support for autotuning on WebGPU, more precise timings by @ArthurBrussee in #167
- Fixed overflow when max page == 4GB on WASM by @ArthurBrussee in #194
- Merged
cubecl-wgpu
andcubecl-wgpu-spirv
by @wingertge in #184
HIP/ROCm
- Added support for ROCm HIP by @syl20bnr in #183
- Added half precision support to HIP by @syl20bnr in #201
- Limited cubecl-hip for Linux targets only by @syl20bnr in #205
SPIR-V
- Added SPIR-V compiler by @wingertge in #155
- Fixed casting, powf and alignment for SPIR-V by @wingertge in #188
Optimization & Performance
- Added value-based partial redundancy elimination by @wingertge in #169
- Added prefetching to into_contiguous by @wingertge in #181
- Added block merging by @wingertge in #163
- Added round and bitwise or operations by @laggui in #99
- Skipped zero initialization of workgroup memory by @ArthurBrussee in #125
- CMMA Optimizations:
- CMMA: cube dispatch strategy by @louisfd in #126
- Reuse lhs frag strategy by @louisfd in #132
- Invert k n loops by @louisfd in #131
- Continuous warp loading by @louisfd in #138
- Relative warp IDs by @louisfd in #144
- Relaxed b_m = b_n by @louisfd in #148
- New strategy for num compute planes + many refactors by @louisfd in #150
Infrastructure
- Added profiling support by @nathanielsimard in #137
- Improved compilation arguments by @nathanielsimard in #141
- Added simple benchmarking capabilities by @jbelanich in #190
- Added periodic memory cleanup by @ArthurBrussee in #178
- Reworked & added ExclusivePages as memory management option by @ArthurBrussee in #158
- Fixed concurrency problems with autotune by @nathanielsimard in #200
- Improved timing methods for benchmarking by @jbelanich in #190
- Fixed CI for Rust 1.82 by @nathanielsimard in #182
- Migrated xtask to tracel-xtask by @syl20bnr in #93
- Updated CI workflow and badges by @syl20bnr in #96
Math & Operations
- Implemented dot product by @RianGoossens in #140
- Implemented magnitude by @RianGoossens in #105
- Added Round, Floor, Ceil for Line by @med1844 in #179
- Implemented Vector Normalization by @RianGoossens in #100
- Added round and bitwise operations by @laggui in #99
Documentation & Examples
- Added simple fusion example by @nathanielsimard in #142
- Updated README by @nathanielsimard in #192
- Added book by @nathanielsimard in #133
- Format floating point values with maximum precision by @ArthurBrussee in #130
Bug Fixes & Maintenance
- Handle empty tensors by @laggui in #86
- Fixed flaky tests in topology by @nathanielsimard in #109
- Fixed no-std support by @nathanielsimard in #175
- Fixed WASM infinite loop by @nathanielsimard in #176
- Fixed deadlock by @ArthurBrussee in #177
- Fixed legacy kernels by auto-casting unary ops by @wingertge in #187
- Fixed pico support by @BjornTheProgrammer in #198
- Fixed check on macOS and minor refactor by @AsherJingkongChen in #204
- Fixed validate checksum by @nathanielsimard in #202
- Fixed for backends with higher alignments by @ArthurBrussee in #191