Releases: tenstorrent/tt-metal
v0.55.0-rc1
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12779718295
📦 Uncategorized
- Add noc read/write burst command support to CCL command kernel. Also add automated command lowering to these noc commands
- PR: #16461
- MeshWorkload: Initial Implementation
- PR: #16405
- [CCL] Fix padding issues
- PR: #16347
- #15868: use a buffer's size when creating its CB in groupnorm
- PR: #16093
- Fix trace region size
- PR: #16519
- #0: Bump E2E perf threshold for host bound WH Resnet variants
- PR: #16522
- Extract Device interface
- PR: #16482
- Extend graph capture to include device information
- PR: #16408
- Quick fix replacing Device* with IDevice in graph tracker
- PR: #16532
- #0: Add unit_tests_ttnn_tensor to post-commit
- PR: #16211
- Xuncai/ccl global sem
- PR: #16455
- #16153: Add fused activations to input tensors
- PR: #16283
- Remove ARCH_NAME specific includes from erisc_datamover_builder
- PR: #16505
- remove unused function
- PR: #16537
- [TT-Train] Updates related to the fixed matmul
- PR: #16540
- [Llama3] Add max prefill chunk sizes for different model/device combinations
- PR: #16508
- Add sharded sweeps identiy, neg, selu, abs
- PR: #15999
- Handle padded shards in
ttnn.convert_to_chw
- PR: #15915
- #16492: Add new APIs for setting which sub_device_ids to stall on
- PR: #16473
- #0: Track local_cb_size to ensure that remote cb config is correctly sent by FD
- PR: #16542
- support keepdim for prod
- PR: #16370
- #16225: Int32 support for abs
- PR: #16226
- Sharded sweeps: prelu, softmax, sinh, softplus, relu_max and relu_min
- PR: #16050
- Changing output channel size in the readme example
- PR: #16303
- Fix double move in TTNN invoke_composite launch_op
- PR: #16551
- Quick fix how to storage/access for devices in the DevicePool
- PR: #16550
- Add native N-dimensional tiled-interleaved permute support when the tiles are now broken.
- PR: #16468
- fix multi-iter in reduce scatter and adopt runtime arg overrider infra
- PR: #16531
- [tt-train] Add linear regression ddp example
- PR: #16245
- Remove eth_l1_address_params.h from device.cpp
- PR: #16538
- Sharded sweeps: exp, exp2, expm1, erfc, erfinv, round, log
- PR: #16323
- Fix
ttnn.concat
golden function when groups > 1- PR: #16556
- #16171: Assert that NCRISC NOC is idle at kernel end.
- PR: #16471
- Remove eth_l1_address_params.h from tt_cluster.cpp and watcher
- PR: #16568
- Remove dev_mem_map.h usage from watcher_device_reader.cpp
- PR: #16572
- #14616: Remove ARCH_* ifdefs from tt_cluster.cpp
- PR: #13354
- Add support for DRAM Prefetcher op
- PR: #16244
- Resolve reduce-scatter-async sharded tensor correctness bug & hang
- PR: #16548
- disable flaky t3k test
- PR: #16583
- Remove "noc_parameters.h" from device.cpp
- PR: #16582
- Remove restriction of input_nsticks_per_core % w == 0
- PR: #15205
- Add tt-forge sweep for conv2d.
- PR: #16178
- Remove noc header file inclusion from watcher_device_reader.cpp
- PR: #16589
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #16484
- Short list failing conv2d for forge sweeps
- PR: #16597
- Remove halo from shard spec
- PR: #15900
- Address issues of var & std
- PR: #16545
- #16492: Remove sub_device_ids apis from various read/write functions throughout the stack
- PR: #16565
- #6344: Update RoBERTa QA demo
- PR: #8896
- Remove noc_parameters.h inclusion from ttnn
- PR: #16593
- Resubmit #16339: parameterize dispatch_constants
- PR: #16478
- #11512: Refactor bitwise sweeps, add bitwise sharded sweeps, modify t…
- PR: #15704
- Update CODEOWNERS
- PR: #16604
- Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
- Set up targeting idle eth cores on BH - won't enable because of hang debug
- PR: #14817
- Reorganize Print Pages Infrastructure
- PR: #16463
- lower fabric erisc datamover eth context switching frequency when workload is running
- PR: #16610
- Composite binary sweeps: gcd and lcm
- PR: #16423
- Remove ARCH_NAME from host library code
- PR: #16616
- [tt-train] Add nanogpt ddp mode
- PR: #16614
- #16312: Fix full op to query physical shape for buffer volume
- PR: #16562
- #16366: Changed default kernal_config_val for 32bit matmul
- PR: #16567
- #16621: Add barriers at end of cq_dispatch_slave.cpp
- PR: #16624
- Build wheels in models unit tests workflow
- PR: #16615
- Mo/10234 eth dispatch profiling
- PR: #15609
- Support subcoregrids in concat_heads
- PR: #16223
- Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
- #16590: profiler trace detection fix
- PR: #16591
- #16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
- Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
- Fix nightly stable diffusion tests
- PR: #16629
- #0: Used github team for conv files
- PR: #16563
- Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
- fix reduce scatter multi-link support bug
- PR: #16636
- support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
- Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- Port all experimental ops to compute_output_specs
- PR: #16595
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
v0.54.0-rc23
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12759327887
📦 Uncategorized
- Isolate tracy
- PR: #16161
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12768962484
📦 Uncategorized
- Isolate tracy
- PR: #16161
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
- #16066: Add seed param to uniform and bernoulli ops
- PR: #16179
- #0: Add StrongType to help creating non-clashing alias types
- PR: #16309
- #0: Fix ccl workers not starting
- PR: #16333
- #15642: Replace shapes in eltwise
- PR: #15646
- Remove old fd init code path
- PR: #16321
- Remove more namespace pollution caused by
using namespace tt::tt_metal
in header file- PR: #16342
- #0: make dependent configs dependent
- PR: #16324
- #13643: Extend binary-ng math support to match all primitive binary ops
- PR: #16276
- Fix wrong output tensor shape for prod
- PR: #16334
- Update CODEOWNERS
- PR: #16358
- Add subdevice support to multicore untilize
- PR: #16193
- add multi-iteration support to reduce scatter async
- PR: #16294
- #16356: Program Dispatch Modifications for MeshWorkload
- PR: #16361
- Refactor conv files using clang-format
- PR: #16340
- #15338: Fix watcher using the wrong cmd bufs for addr sanitization when using dynamic noc
- PR: #16363
- Add cluster-axis API support to reduce scatter
- PR: #16293
- split ttnn unit tests 8 ways
- PR: #16382
- split ttnn tests into 10 groups
- PR: #16383
- #0: Fixes for remote circular buffer synchronization
- PR: #16378
- #0: Initial tech report for Sub-Device feature
- PR: #16387
- Adapt to tt-system-tools hugepages configuration
- PR: #14396
- Further removal of Shape/LegacyShape in order to allow 0D/1D tensors
- PR: #16337
- #16134: add test cases for pre-allocated CreateBuffer / ttnn::event_query
- PR: #16135
- setting multi-core for tilize with padding
- PR: #16252
- reshape assert fix
- PR: #16300
- #16165: Add binary SFPU divide init function
- PR: #16250
- #15879: supported subcoregrid for createqkv heads
- PR: #15972
- Reimplemented dropout as separate op.
- PR: #16328
- #16356: Reland Program Dispatch Modifications for MeshWorkload
- PR: #16385
- suppport all dim lengths for reduction
- PR: #16247
- Check that writes don't go to below the ringbuffer
- PR: #16399
- #16390: Move reduce_scatter_async into experimental namespace and enable cluster api tests
- PR: #16407
- Typecast in ng
- PR: #16317
- Speed up linking for incremental builds.
- PR: #15994
- #0: Don't return shared ptrs of global sems/cbs, and directly return the object instead
- PR: #16354
- Add support for act_block_h_override to Width Sharded Conv2d
- PR: #16374
- #0: Fix CMakeLists
- PR: #16417
- Update install_dependencies.sh to install hugepages using tt-system-tools hugepages service
- PR: #15953
- delete stale/(now) invalid assert after recent update to use virtual …
- PR: #16313
- Fix CB Overflow issue on certain transposes and permutes
- PR: #16155
- Removing LegacyShape from Tensor::pad
- PR: #16424
- Add experimental APIs to access Hal
- PR: #16426
- Remove documenation references to "setup_hugepages.py"
- PR: #16428
- #16175: Add DPRINT TileSlice support for int types
- PR: #16413
- Fix remaining minor input/output issues with TG-Llama3 vLLM integration
- PR: #16437
- #0: Reshuffle some logic in resize_remote_sender/receiver_cb_interface to fix perf degradation in some models
- PR: #16436
- Move conv specific ops from tensor_utils to conv2d.
- PR: #16373
- Support all ND shapes for tilize/untilize
- PR: #16299
- Remove unused ARCH_NAME specific includes "eth_l1_address_map.h"
- PR: #16445
- #0: Fix failing test case for width sharded non-32 multiple output width
- PR: #16224
- #15605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16401
- #16339: parameterize dispatch_constants
- PR: #16355
- Ucheema/tt fabric arch md
- PR: #16456
- Add
ttnn.conv2d
unit tests for UNet Shallow at groups=4,6,8- PR: #16452
- Pad greater than 4D
- PR: #16453
- [tt-train] Memory efficient option to run GPT2
- PR: #16205
- #15732: add matmul block h/w parameter processing
- PR: #15938
- #0: Enable unity for sublibraries
- PR: #16450
- Remove redundant function determine_parallel_config_non_tile_mul_width.
- PR: #15955
- Add support for tiled indices via padding/alignment aware embedding kernel (tiled indices only)
- PR: #16296
- Bw sharded sweeps: neg_bw, log_bw, relu_bw, relu6_bw, leaky_relu_bw, rsqrt_bw
- PR: #16344
- Conv2dConfig reallocate_halo_output default to true
- PR: #16185
- [Llama3] Change prefill padding in LlamaGenerator to nearest 2048 and optimize chunked prefill readback
- PR: #16472
- Added check for global non-constexpr uint64_t value in kernel
- PR: #16476
- Update CONTRIBUTING.md
- PR: #16475
- Dedicated target for HostDevCommon
- PR: #16493
- Fix bug when calling CreateDevice in a loop on TG
- PR: #16260
- Fix cb allocation errors for halo and conv2d
- PR: #16190
- The library is the authority on include dir locations, not the consumers
- PR: #16164
- #0: fix corerange handling in ROPE
- PR: #16444
- undo revert of #16247
- PR: #16430
- #16495: update test pccs after matmul changes and skip test with ND PCC failure
- PR: #16498
- Reserve vector in cluster function
- PR: #16507
- Xuncai/flash decode bugfix
- PR: #16362
v0.54.0-rc22
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12739066348
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- Isolate tracy
- PR: #16161
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc21
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12719934369
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc20
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12701599118
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- Python -> Python3
- PR: #16063
- #15713 Bad Eltwise Binary ZEROACC
- PR: #16094
- #15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
- Fix paged SDPA decode CB sizing issue
- PR: #16059
- Reland async dispatch with workaround for hang.
- PR: #16121
- #16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
- #10034: Binary shift operators
- PR: #16055
- #0: Remove incorrect memory span assert
- PR: #16136
- Add forge sweeps for slice and transpose
- PR: #16112
- #0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
- #16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
- #0: aligning conv2d transpose as conv
- PR: #16128
- support missing cases for sweep tests
- PR: #15804
- #0: added normalization details in the tech report
- PR: #15124
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
- Port all Moreh OPs to compute_output_specs
- PR: #16160
- Bump umd to fix grayskull cluster bug
- PR: #16126
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc19
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12662398466
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- Python -> Python3
- PR: #16063
- #0: separate validation of conv weight and bias.
- PR: #15990
- #0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
- C++ files should not be part of the API of a library
- PR: #16123
- #15857: Forge sweep test
- PR: #15858
- #15857: Unary forge sweep tests
- PR: #15901
- Fix some more namespace pollution caused by
using namespace tt::tt_metal
- PR: #16090
- #15713 Bad Eltwise Binary ZEROACC
- PR: #16094
- #15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
- Fix paged SDPA decode CB sizing issue
- PR: #16059
- Reland async dispatch with workaround for hang.
- PR: #16121
- #16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
- #10034: Binary shift operators
- PR: #16055
- #0: Remove incorrect memory span assert
- PR: #16136
- Add forge sweeps for slice and transpose
- PR: #16112
- #0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
- #16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
- #0: aligning conv2d transpose as conv
- PR: #16128
- support missing cases for sweep tests
- PR: #15804
- #0: added normalization details in the tech report
- PR: #15124
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
- Port all Moreh OPs to compute_output_specs
- PR: #16160
- Bump umd to fix grayskull cluster bug
- PR: #16126
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc18
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12643496109
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- #0: Remove some dead code
- PR: #16084
- Updated installation script
- PR: #16101
- Python -> Python3
- PR: #16063
- Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
- Adding ND support for tilize/untilize with padding
- PR: #15933
- [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
- #0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
- #15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
- Remove setup_core_to_tlb_map
- PR: #16048
- #0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
- #0: separate validation of conv weight and bias.
- PR: #15990
- #0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
- C++ files should not be part of the API of a library
- PR: #16123
- #15857: Forge sweep test
- PR: #15858
- #15857: Unary forge sweep tests
- PR: #15901
- Fix some more namespace pollution caused by
using namespace tt::tt_metal
- PR: #16090
- #15713 Bad Eltwise Binary ZEROACC
- PR: #16094
- #15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
- Fix paged SDPA decode CB sizing issue
- PR: #16059
- Reland async dispatch with workaround for hang.
- PR: #16121
- #16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
- #10034: Binary shift operators
- PR: #16055
- #0: Remove incorrect memory span assert
- PR: #16136
- Add forge sweeps for slice and transpose
- PR: #16112
- #0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
- #16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
- #0: aligning conv2d transpose as conv
- PR: #16128
- support missing cases for sweep tests
- PR: #15804
- #0: added normalization details in the tech report
- PR: #15124
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
- Port all Moreh OPs to compute_output_specs
- PR: #16160
- Bump umd to fix grayskull cluster bug
- PR: #16126
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc17
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12624900279
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- #0: Remove some dead code
- PR: #16084
- Updated installation script
- PR: #16101
- Python -> Python3
- PR: #16063
- Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
- Adding ND support for tilize/untilize with padding
- PR: #15933
- [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
- #0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
- #15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
- Remove setup_core_to_tlb_map
- PR: #16048
- #0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
- #0: separate validation of conv weight and bias.
- PR: #15990
- #0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
- C++ files should not be part of the API of a library
- PR: #16123
- #15857: Forge sweep test
- PR: #15858
- #15857: Unary forge sweep tests
- PR: #15901
- Fix some more namespace pollution caused by
using namespace tt::tt_metal
- PR: #16090
- #15713 Bad Eltwise Binary ZEROACC
- PR: #16094
- #15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
- Fix paged SDPA decode CB sizing issue
- PR: #16059
- Reland async dispatch with workaround for hang.
- PR: #16121
- #16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
- #10034: Binary shift operators
- PR: #16055
- #0: Remove incorrect memory span assert
- PR: #16136
- Add forge sweeps for slice and transpose
- PR: #16112
- #0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
- #16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
- #0: aligning conv2d transpose as conv
- PR: #16128
- support missing cases for sweep tests
- PR: #15804
- #0: added normalization details in the tech report
- PR: #15124
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
- Port all Moreh OPs to compute_output_specs
- PR: #16160
- Bump umd to fix grayskull cluster bug
- PR: #16126
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
v0.54.0-rc16
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12606309953
📦 Uncategorized
- Add buffering to DPRINT
- PR: #15677
- Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
- PR: #16086
- [UMD] Removed set_*_params calls and constants
- PR: #15908
- #0: Remove some dead code
- PR: #16084
- Updated installation script
- PR: #16101
- Python -> Python3
- PR: #16063
- Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
- Adding ND support for tilize/untilize with padding
- PR: #15933
- [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
- #0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
- #15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
- Remove setup_core_to_tlb_map
- PR: #16048
- #0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
- #0: separate validation of conv weight and bias.
- PR: #15990
- #0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
- C++ files should not be part of the API of a library
- PR: #16123
- #15857: Forge sweep test
- PR: #15858
- #15857: Unary forge sweep tests
- PR: #15901
- Fix some more namespace pollution caused by
using namespace tt::tt_metal
- PR: #16090
- #15713 Bad Eltwise Binary ZEROACC
- PR: #16094
- #15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
- Fix paged SDPA decode CB sizing issue
- PR: #16059
- Reland async dispatch with workaround for hang.
- PR: #16121
- #16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
- #10034: Binary shift operators
- PR: #16055
- #0: Remove incorrect memory span assert
- PR: #16136
- Add forge sweeps for slice and transpose
- PR: #16112
- #0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
- #16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
- #0: aligning conv2d transpose as conv
- PR: #16128
- support missing cases for sweep tests
- PR: #15804
- #0: added normalization details in the tech report
- PR: #15124
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
- Port all Moreh OPs to compute_output_specs
- PR: #16160
- Bump umd to fix grayskull cluster bug
- PR: #16126
- Clean-up the usage of deallocate_activation
- PR: #16099
- llm tech report multi device section
- PR: #16180
- Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
- #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
- [LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
- #16165: Disabling test that depends on some machine state to pass
- PR: #16166
- enable dps ops for matmul
- PR: #15285
- Isolate tracy
- PR: #16161
- [TT-Train ]added tests for sum and mean
- PR: #16152
- #16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
- #15221: Post completion messages to dispatch_s
- PR: #16187
- [TT-Train] Added softmax backward
- PR: #16168
- Optimized FreeList allocator
- PR: #15536
- Set the test data to be relative to the test binary
- PR: #16150
- #0: Fix matmul doc string
- PR: #16208
- #0: remove spammy warning from conftest
- PR: #16198
- Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
- LLM tech report sections 2.2, 2.5
- PR: #15121
- [TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
- Updating Allocator docs to explain first fit usage
- PR: #16214
- Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
- Fix
ttnn.reallocate
when unaligned RM tensors are used- PR: #16192
- #15891: improve full accuracy and fix full bugs
- PR: #16182
- Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
- #15857: Skip abs forge for GS
- PR: #16221
- #16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
- Add max kernel size for each risc type in an op
- PR: #16203
- Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
- #12662: add keepdim fixes to reduce
- PR: #16163
- Add chunked prefill to Llama family
- PR: #16111
- #15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
- Update CODEOWNERS
- PR: #16196
- support reduction for 3d & 4d dims
- PR: #16236
- #5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
- Add full support for creating tensors with logical sharding from python
- PR: #16072
- update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
- #15857: Binary Forge Sweep Tests Set2
- PR: #16087
- #14976/#15039: Add Support For ceil_mode=True
- PR: #16124
- Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
- Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
- New FD Init Flow
- PR: #15406
- Add support for output sharded embeddings
- PR: #16237
- Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
- #0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
- MeshDevice: Support Quanta Galaxy system file
- PR: #16239
- Move Device members from public to private
- PR: #16256
- Add unary sharded sweeps
- PR: #15300
- #0: Added core_grid offset for sharded layernorm
- PR: #16207
- fix abs path bug for sweeps tests code
- PR: #16285
- #0: Publish TT-Distributed doc under tech_reports
- PR: #16261
- #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
- #16265: Remove creation op
- PR: #16269
- Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
- Fix compile issue for earlier c++ versions
- PR: #16291
- #0: Typo fix in TT distributed tech report
- PR: #16308
- [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
- LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
- LLM Tech report section 4.4
- PR: #15166
- Move some Device methods to private section
- PR: #16259
- #0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
- #15857: Binary Forge Sweep Tests Set1
- PR: #16042
- #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
- #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315