v0.54.0
Pre-release
Pre-release
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12768962484
📦 Uncategorized
- Isolate tracy
- PR: #16161
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
- #16066: Add seed param to uniform and bernoulli ops
- PR: #16179
- #0: Add StrongType to help creating non-clashing alias types
- PR: #16309
- #0: Fix ccl workers not starting
- PR: #16333
- #15642: Replace shapes in eltwise
- PR: #15646
- Remove old fd init code path
- PR: #16321
- Remove more namespace pollution caused by
using namespace tt::tt_metal
in header file- PR: #16342
- #0: make dependent configs dependent
- PR: #16324
- #13643: Extend binary-ng math support to match all primitive binary ops
- PR: #16276
- Fix wrong output tensor shape for prod
- PR: #16334
- Update CODEOWNERS
- PR: #16358
- Add subdevice support to multicore untilize
- PR: #16193
- add multi-iteration support to reduce scatter async
- PR: #16294
- #16356: Program Dispatch Modifications for MeshWorkload
- PR: #16361
- Refactor conv files using clang-format
- PR: #16340
- #15338: Fix watcher using the wrong cmd bufs for addr sanitization when using dynamic noc
- PR: #16363
- Add cluster-axis API support to reduce scatter
- PR: #16293
- split ttnn unit tests 8 ways
- PR: #16382
- split ttnn tests into 10 groups
- PR: #16383
- #0: Fixes for remote circular buffer synchronization
- PR: #16378
- #0: Initial tech report for Sub-Device feature
- PR: #16387
- Adapt to tt-system-tools hugepages configuration
- PR: #14396
- Further removal of Shape/LegacyShape in order to allow 0D/1D tensors
- PR: #16337
- #16134: add test cases for pre-allocated CreateBuffer / ttnn::event_query
- PR: #16135
- setting multi-core for tilize with padding
- PR: #16252
- reshape assert fix
- PR: #16300
- #16165: Add binary SFPU divide init function
- PR: #16250
- #15879: supported subcoregrid for createqkv heads
- PR: #15972
- Reimplemented dropout as separate op.
- PR: #16328
- #16356: Reland Program Dispatch Modifications for MeshWorkload
- PR: #16385
- suppport all dim lengths for reduction
- PR: #16247
- Check that writes don't go to below the ringbuffer
- PR: #16399
- #16390: Move reduce_scatter_async into experimental namespace and enable cluster api tests
- PR: #16407
- Typecast in ng
- PR: #16317
- Speed up linking for incremental builds.
- PR: #15994
- #0: Don't return shared ptrs of global sems/cbs, and directly return the object instead
- PR: #16354
- Add support for act_block_h_override to Width Sharded Conv2d
- PR: #16374
- #0: Fix CMakeLists
- PR: #16417
- Update install_dependencies.sh to install hugepages using tt-system-tools hugepages service
- PR: #15953
- delete stale/(now) invalid assert after recent update to use virtual …
- PR: #16313
- Fix CB Overflow issue on certain transposes and permutes
- PR: #16155
- Removing LegacyShape from Tensor::pad
- PR: #16424
- Add experimental APIs to access Hal
- PR: #16426
- Remove documenation references to "setup_hugepages.py"
- PR: #16428
- #16175: Add DPRINT TileSlice support for int types
- PR: #16413
- Fix remaining minor input/output issues with TG-Llama3 vLLM integration
- PR: #16437
- #0: Reshuffle some logic in resize_remote_sender/receiver_cb_interface to fix perf degradation in some models
- PR: #16436
- Move conv specific ops from tensor_utils to conv2d.
- PR: #16373
- Support all ND shapes for tilize/untilize
- PR: #16299
- Remove unused ARCH_NAME specific includes "eth_l1_address_map.h"
- PR: #16445
- #0: Fix failing test case for width sharded non-32 multiple output width
- PR: #16224
- #15605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16401
- #16339: parameterize dispatch_constants
- PR: #16355
- Ucheema/tt fabric arch md
- PR: #16456
- Add
ttnn.conv2d
unit tests for UNet Shallow at groups=4,6,8- PR: #16452
- Pad greater than 4D
- PR: #16453
- [tt-train] Memory efficient option to run GPT2
- PR: #16205
- #15732: add matmul block h/w parameter processing
- PR: #15938
- #0: Enable unity for sublibraries
- PR: #16450
- Remove redundant function determine_parallel_config_non_tile_mul_width.
- PR: #15955
- Add support for tiled indices via padding/alignment aware embedding kernel (tiled indices only)
- PR: #16296
- Bw sharded sweeps: neg_bw, log_bw, relu_bw, relu6_bw, leaky_relu_bw, rsqrt_bw
- PR: #16344
- Conv2dConfig reallocate_halo_output default to true
- PR: #16185
- [Llama3] Change prefill padding in LlamaGenerator to nearest 2048 and optimize chunked prefill readback
- PR: #16472
- Added check for global non-constexpr uint64_t value in kernel
- PR: #16476
- Update CONTRIBUTING.md
- PR: #16475
- Dedicated target for HostDevCommon
- PR: #16493
- Fix bug when calling CreateDevice in a loop on TG
- PR: #16260
- Fix cb allocation errors for halo and conv2d
- PR: #16190
- The library is the authority on include dir locations, not the consumers
- PR: #16164
- #0: fix corerange handling in ROPE
- PR: #16444
- undo revert of #16247
- PR: #16430
- #16495: update test pccs after matmul changes and skip test with ND PCC failure
- PR: #16498
- Reserve vector in cluster function
- PR: #16507
- Xuncai/flash decode bugfix
- PR: #16362