Releases: tenstorrent/tt-metal
v0.44.0
📦 Uncategorized
- Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5154
- #4794: Implement DownBlock2D using ttnn for stable_diffusion model
- PR: #5091
- #4797: Implement BasicTransformerBlock sub-module using ttnn for stab…
- PR: #5081
- #0: write cluster config for FD mode, non tunneling cores as well
- PR: #5161
- Update bw test, change mulsi calls to use *
- PR: #5149
- #3003: updated tt-lib documentation
- PR: #5179
- #0: Update to v0.44.0
- PR: #5188
- #4003: added ability to trace ttnn operations using torchtrail library
- PR: #5135
- Support moreh logsoftmax
- PR: #4961
- #4614: gitmodules: Use https URLs for submodules
- PR: #5183
- #0: add reviewers to frequently touched ops docs file
- PR: #5190
- backward ops - hypot and atan2
- PR: #5045
- #4885: Move program device map to program
- PR: #5193
- #4858: Add support for float to int typecast
- PR: #5058
- Matmul_block on a smaller grid size
- PR: #5170
- Revert "#0: Add support for typecast float to int"
- PR: #5199
- Add dst ethernet router support and remote command processor to accept FD packets on remote chip
- PR: #5102
- Falcon40B TT Implementation
- PR: #5046
- #5198: Fix moreh softmax related bug
- PR: #5200
- #0: skip MOREH Softmax tests from main
- PR: #5202
- #3122: Use device grid size in falcon_attention to be genereric...
- PR: #5207
- #0: Add assertions for interleaved tensors for ops that don't support sharding
- PR: #5195
- #5169: Add activation ops to ttnn
- PR: #5217
- #3003: add duration to the ttnn operation nodes when TTNN_ENABLE_LOGGING=1 is used to compile the code
- PR: #5201
- #5027: Optimize group attn matmul for Falcon40B decode
- PR: #5127
- #0: add documentation about managing documentation
- PR: #5227
- Adding docs for maxpool, avg pool and upsample
- PR: #5223
- Revert "#0: skip MOREH Softmax tests from d5811b7…
- PR: #5228
- #5165: Add hyperbolic ops to ttnn
- PR: #5166
- #4866: Add grayskull open source llk-library
- PR: #5136
- #5002: simplified preprocessing of CNNs using preprocess_model
- PR: #5181
- Create GroupNorm sharded in TTNN
- PR: #5221
- #5097: Support for dedicated completion queue thread
- PR: #5098
- upsample test calculate grid
- PR: #5238
- fix for sharded allocater when num banks == num cores
- PR: #5229
- MHA tutorial interactive notebook with diagrams
- PR: #5239
- #4003: Adding a profile tutorial
- PR: #5242
- #0: Added non-blocking read stress test
- PR: #5243
- Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5245
- #0: Update all_gather to work for multi_link. Update falcon-40b to use 2 links for all gathers
- PR: #5214
- #5142: Remove slow dispatch mode from workgin sweeps
- PR: #5146
- #3003: fixed the input tensor documentation
- PR: #5255
- #0: Temp slower resnet VM run
- PR: #5256
- throw on fast dispatch for to_host_sharded as its not supported
- PR: #5264
- #5253: Fix kv_past_len being passed in to rotary embedding for falcon models
- PR: #5254
- #5233: started adding ttnn_functional_resnet
- PR: #5240
- #3003: updated ttnn documentation to explain what features it has over tt_lib. Added standalone examples of basic usage of ttnn
- PR: #5265
- #0: Speedup incremental builds
- PR: #5251
- #0: Change setup.py to be git worktree friendly
- PR: #5234
- MHA tutorial interactive notebook with diagrams
- PR: #5277
- #3003: disable tutorial 6 from running as the unit test
- PR: #5278
- Agrebenisan/non blocking tensor reads
- PR: #5244
- #5275: CODEOWNERS: update to include files relevant for ttnn team
- PR: #5276
- Fix an intermittent launch message transfer error
- PR: #5152
- Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5282
- #0: add parens in LLK doc
- PR: #5283
- #3003: only unit test tutorials that work on pipelines
- PR: #5291
- #5246: Add unary math ops to ttnn
- PR: #5259
- Vignesh/stable diffusion ttnn basic transformer block fix
- PR: #5211
- #4854: Implement attention and rms_norm sub-module using ttnn for mis…
- PR: #5175
- #4795: Add upblock2d to functional stable diffusion model
- PR: #5085
- #4796: Implement Transformer2DModel using ttnn for stable_diffusion m…
- PR: #5092
- #0: Adding llk wormhole_b0 submodule
- PR: #5262
- #4003: Adding pyind11 to ttnn
- PR: #5236
- #5296: Fix broken link to host_api.hpp in README.md
- PR: #5297
- #0: Fix bug with the way we were measuring bert inference time
- PR: #5312
- #0: Change local tt_lib._C module install from symlink to copy
- PR: #5292
- #5233: added ability to fold batch_norm2d into conv2d
- PR: #5317
- #5222: replace hex8_to_hex32.py with cpp to shave off some compile time -temporary fix
- PR: #5220
- Enable tests for WHB0
- PR: #5307
- #5137: Cleanups for newer Linux distro / toolchains
- PR: #5162
- #5233: implemented support for converting all Resnet-18 modules using preprocess_model function
- PR: #5325
- #3003: fix model preprocessing bug
- PR: #5332
- #4799: Implement CrossAttnDownBlock2D sub-module using ttnn for stabl…
- PR: #5086
- #4800: Implement UNetMidBlock2DCrossAttn using ttnn for stable_diffus…
- PR: #5093
- #4798: Add ttnn cross attn upblock2d in functional stable diffusion m…
- PR: #5089
- #4801: Implement Unet 2D Condition model using ttnn for stable_diffus…
- PR: #5119
- #4965: Rename Conv2D to Conv2d and MaxPool2D to MaxPool2d to match torch
- PR: #5219
- #0: Remove departed team member from CODEOWNERS
- PR: #5340
- #0: add to codeowners
- PR: #5339
- #5314: Only stall on first scheduled read after commands with side effects
- PR: #5315
- #4965: fix bad rebase
- PR: #5342
- #0: Add more instructions for dispatching workflow actions and a note about skipping git hooks
- PR: #5345
- Update optimized Bert to support WH grid sizes, add sharding support for RMSNorm
- PR: #5308
- #4642: create gtest_smoke as a sanity test suit
- PR: #5112
- #5341: context switch if eth txq is full
- PR: #5347
- #5323: Convolutions of small size fail during parallelization calculations
- PR: #5324
- Npetrovic/transformer softmax
- PR: #5298
- Fix groupnorm for narrow channels
- PR: #5320
- #4862: added more test for ttnn bloom. Update optimized ttnn bert to match the structure of non-optimized ttnn bert
- PR: #5336
- #0: Add an envvar parser with value detection and default value setti…
- PR: #5367
- #4732: Clean up compute kernel apis
- PR: #5316
- #5318: Modify Falcon7B to use attn_matmul for wormhole
- PR: #5322
- #0: make logLocationsRecord a static function
- PR: #5351
- #5233: run convs with auto-format
- PR: #5364
- #5377: Avoid segfault by checking buffer !null before getting device
- PR: #5381
- Alex/metal/pack untilize b0
- PR: #5378
- #4487: Support block sharding in upsample
- PR: #5361
- #5359: update python package transformers + dependencies to include Falcon
- PR: #5360
- #3708: Add support for LN having gamma/beta in bfp8
- PR: #5376
- #4003: Skip sweep tests if not available
- PR: #5392
- #4003: use faster TMs in optimized ttnn whisper
- PR: #5384
- #4732: Clean up compute_kernel_api
- PR: #5375
- More optimizations for group_attn_matmul
- PR: #5385
- #5233: updated resnet18 to run residual connections
- PR: #5390
- #3003: added more meaningful errors to ttnn. Updated getitem to run on device in the cases when it can
- PR: #5403
- #5233: simplified the logic in tracer
- PR: #5370
- #3003: include ttl operations and necessary types under ttnn.ttl
- PR: #5405
- #0: Add note about no merge commits in main
- PR: #5349
- #0: Add timeout in profiler regression workflow
- PR: #5355
- codeowners update
- PR: #5407
- #5365: Add device argument to determine grid size based on target
- PR: #5366
- disable whisper until further investigation, see issue #5430
- PR: #5431
- #3003: fixed ttnn convs
- PR: #5432
- #3886: Fix build error for C++ tests in debug mode
- PR: #5434
- #4954: Support depth 32 in maxpool writer
- PR: #4956
- #0: Pass output cb to pack init functions
- PR: #5418
- #0: skipping DeviceLoadBlankKernels on remote devices
- PR: #5437
- #5359: transformers: update version and relax pcc asserts
- PR: #5421
- #3003: guidelines for adding new op
- PR: #5440
- Don't assume user has one entry in their
$PYTHONPATH
- PR: #5250
- FP32 tensor support for matmul
- PR: #5414
- #3003: updated tutorial 001 to describe the tensor more comprehensively before showing the add
- PR: #5441
- Onboard additional metal code owners
- PR: #5445
- #5402: Add redesigned host-side sw command queue, it can be configured i…
- PR: #5382
- #3003: fixed docs
- PR: #5455
- Alex/metal/enable conv tests on b0
- PR: #5425
- #5356: git bisect script to find broken commits
- PR: #5348
- #0: Update data_format.cpp file
- PR: #5399
- Add skip to full grid matmul whb0
- PR: #5461
- #3003: simplified the logic in ttnn/operations/matmul.py. Added dataclasses instead of tuples for CoreGrid and ShardShape
- PR: #5450
- #5204: adding moreh's test suit. removing an absolute assertion.
- PR: #5373
- Npetrovic/lt gt ne fix
- PR: #5304
- #0: Move device id attribute from tensor to DeviceStorage
- PR: #5467
- #3003: fixed scheduled pipeline
- PR: #5466
- Npetrovic/transformer concat sweeps ttnn
- PR: #5208
- #3003: added support for running ttnn.matmul using 1D_systolic_array. Also, added support for passsing in the program config directly
- PR: #5468...
v0.43.0
📦 Uncategorized
- #4668: Yolov5 GS Demo Benchmarking
- PR: #4776
- #0: uplift umd; pick up fix for n150 cluster
- PR: #4881
- #3178: Fix for wormhole b0 reduce w
- PR: #4882
- #4489: fixed bugs in the program caching of eltwise unary and eltwise binary. Updated bloom to use L1 memory config
- PR: #4842
- #4821: Add cumsum op to tt_dnn
- PR: #4824
- Dispatch/Bandwidth tests
- PR: #4783
- #4003: fixed test_eltwise_unary_op
- PR: #4901
- Argmax and Argmin Support
- PR: #4779
- #3212: softmax works after reduce fix of max, sum, etc. for WHB0
- PR: #4907
- #0: (MINOR) Update version to v0.43.0
- PR: #4910
- #4761: Add call to ttl repeat_interleave and also provide script for …
- PR: #4891
- #4003: fixed the bug with printing the compile-time attributes
- PR: #4918
- Support moreh arange
- PR: #4921
- Remove skip_for_wormhole_b0 for test_moreh_softmax and test_moreh_softmin
- PR: #4924
- #4541: remove unpad start at 0 limitation
- PR: #4566
- Agrebenisan/restart cmd fix
- PR: #4922
- Support moreh SGD
- PR: #4929
- #0: Use fetch-depth: 0 instead of fetch-tags because otherwise git complains of commit SHA/tag conflict
- PR: #4934
- #0: Add code owners for primary operations api binding
- PR: #4936
- #4547: Add 2x2 window unit tests to ttnn maxpool
- PR: #4909
- #4003: restructure ttnn
- PR: #4902
- #4889: Change TileSlice printing to only print tile data
- PR: #4912
- #4836: Add support for blocking conv activation in 2d systolic conv v…
- PR: #4837
- #0: Update unicast cycles lower bound
- PR: #4937
- #4904: Add support for 1d width sharded LN
- PR: #4905
- #4941: Convert command header to struct for easier maintainability
- PR: #4942
- #4823: enable sum_0 operation fails with low PCC [Wormhole,Grayskull]
- PR: #4955
- Fix sharded buffers for one core in fast dispatch
- PR: #4944
- #4906: global reduce sum, mean, max, min operations added
- PR: #4908
- Revert "#4823: enable sum_0 operation fails with low PCC [Wormhole,GS]
- PR: #4963
- #0: Change codeowners from specific op binding files/dirs to all tt_lib bindings
- PR: #4938
- #4003: split unary sweep into per op sweeps
- PR: #4952
- #4232: added support for converting from numpy arrays to ttnn tensors. Borrow data whenever possible when converting from numpy/torch
- PR: #4893
- Uplift AttnMatmul to support GroupAttnMatmul
- PR: #4913
- Add watcher-specific CI tests
- PR: #4919
- #4916: Add avg pool to ttnn
- PR: #4917
- #0: Add a lock on DPRINT server raise/wait structures
- PR: #4920
- #4967: added validation for input tensors
- PR: #4977
- #4971: update documentation by a new doc hierarchy;
- PR: #4983
- #0: Leftover decorate_operation replacement for avg pool
- PR: #4987
- #4899: fix the permute to operate on the intended shape
- PR: #4951
- #4730: Add tt_lib.tensor.concat
- PR: #4990
- Aliu/enqueue eth
- PR: #4845
- #4003: Updating functional performance from changes in ttnn.permute w…
- PR: #4991
- #4984: Remove dead OP_INFO and graph interpreter
- PR: #4985
- #4878: initial commit to add Conv parameters to ttnn.preprocess_model_parameters
- PR: #4966
- Update Program Hashes for Ops using Mem config
- PR: #4953
- #4984: Remove unused dprint functionality
- PR: #5000
- Aliu/ci fix
- PR: #5001
- #4215: Add Argmax and Argmin Fallback
- PR: #4928
- #4999: added input tensor validation to add, sub and mul operations.
- PR: #5004
- Support for softmax rm major sharding and causal mask sharding
- PR: #5006
- #0: provide API for where() to support scalar True/False branches
- PR: #4988
- #5003: Update expected compile and runtimes for perf regression on VM
- PR: #5008
- Revert "Update Program Hashes for Ops using Mem config"
- PR: #5021
- #4931: add apis to get ethernet by socket ids
- PR: #4932
- #4786: Add upsample_nearest2d functional stable diffusion
- PR: #4870
- #4986: deploy docs only to main and enable devs to run docs build on different pages
- PR: #5020
- Deploy ttnn sweeps results to docs
- PR: #5019
- #4958: Move all python api unit tests to frequent in order to reduce SD pipeline length
- PR: #4981
- #4999: Added input validation for ttnn.matmul and ttnn.linear. Add unit test for linear operation. Update input tensor validation in binary.py. Fix compute_output_shapes in bmm_op.cpp
- PR: #5010
- #4620: Fix+improve bw test
- PR: #5029
- #4852: Add unit tests for functional bloom
- PR: #5013
- #5032: scalar argument versions for relops
- PR: #5018
- #0: Add some README recommendations from MCW to clarify issue about access to internal workflows VM installation page
- PR: #5034
- #4790: Implement GEGLU using ttnn for stable_diffusion model
- PR: #4869
- #4999: Adding validation checks
- PR: #5011
- #4791: Implement Feedforward sub-module using ttnn for stable_diffusi…
- PR: #4868
- Npetrovic/bw ops sweeps
- PR: #5009
- #4999: update documentation of ttnn operations to include the validation schema
- PR: #5031
- #0: Remove model run from frequent_api_pipeline per @tt-rkim
- PR: #5043
- Minor dprint/watcher cleanup
- PR: #5030
- #4858: Add support for typecast
- PR: #4840
- #0: Disable dprint tests because they're flaky at the moment
- PR: #5026
- #4946: Add trig ops to ttnn
- PR: #5041
- Nshanker/convs split by 2
- PR: #5042
- #4946: Add inv trig ops to ttnn
- PR: #5038
- #4003: fixed circular dependency in decorators
- PR: #5052
- #5054: Removed asserts from conv op host code that are not required. …
- PR: #5055
- #4003: fixed circular dependencies in ttnn
- PR: #5061
- #4852: Fix CI pipeline by re-enabling functional bloom for causal LM
- PR: #5060
- GroupNorm Sharded. support
- PR: #4945
- #4972: is_sharded and memory_config is free from tensor
- PR: #4980
- #0: eltwise ops/activate operator tracking for GS, and WHB0
- PR: #5074
- Aliu/fd tunneling pr
- PR: #4725
- #4642: Converted 14 old cpp tests to use gtest, with capabilities to switch btwn FD/SD when possible
- PR: #5050
- #4852: Add tests for functional ttnn bloom implementation.
- PR: #5078
- #4003: correctly convert all parameters of torch module to ttnn parameters
- PR: #5100
- #5082: Pow gradient calculation method is different with pytorch
- PR: #5106
- Argmax/Argmin support for channel, batch and all dim
- PR: #5040
- #4420: switch to shared_ptr
- PR: #5123
- #4420: return shared_future from taskflow async wrapper
- PR: #5121
- Minor DPrint fixes
- PR: #5108
- #0: Enable/disable clearing L1 from env var
- PR: #5107
- #4003: started moving ttnn operation to C++
- PR: #5111
- #4003: Add script to help with finding issues that we need approval for
- PR: #5129
- #5044: Adding support for optional output tensors
- PR: #5104
- #4003: Adding the open flag to show only open PRs
- PR: #5134
- #5048: Add CreateDevices and CloseDevices api to detail
- PR: #5118
- decouple ClearProgramCache from CommandQueue
- PR: #5124
- Conv fixes for padding input channels. Shallow conv fixes. Conv input/output autoformatting. Cleanup
- PR: #5109
- Asarje/mp unpack tilize fused
- PR: #5033
- Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5125
- #5137: Cleanups for newer Linux distro / toolchains
- PR: #5114
- Revert "#5137: Cleanups for newer Linux distro / toolchains"
- PR: #5139
- Revert "Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr"
- PR: #5138
- #4793: Implement ResnetBlock2D using ttnn for stable_diffusion model
- PR: #5084
- #4788: Implement Downsample2D using ttnn for stable_diffusion model
- PR: #5090
- #4792: Implement CrossAttention sub-module using ttnn for stable_diff…
- PR: #4927
- #4747: Reduce amount of samples in bert sweeps
- PR: #5140
- #4789: Add upsample2d to functional_stable_diffusion model
- PR: #5080
- #0: Add fix for lamb optimizer
- PR: #5144
- #5057: Add relational ops support to TTNN
- PR: #5120
- skip eth test suite on GS
- PR: #5155
- #4003: updated ttnn.Tensor to be derived form ttl.tensor.Tensor
- PR: #5130
- Asarje/shwetank upsample
- PR: #5105
- #5082: power gradient is erroneous when exponent is in range (0-1)
- PR: #5158
v0.42.0
📦 Uncategorized
- Syrmia/new sweeps
- PR: #4390
- Update test sweeps for the system memory input buffer
- PR: #4245
- #4181: Add bfloat8_b dtype fix for tests that should support bfloat8_b
- PR: #4207
- #4343: Add new op sweeps for GS and WH
- PR: #4408
- #0: (MINOR) Update to v0.42.0
- PR: #4714
- #4311: Automate determining and scheduling RC generation
- PR: #4713
- Jedi main
- PR: #4690
- #0: Remove path appends from test files
- PR: #4715
- #4003: Adding padding for whisper
- PR: #4578
- #4632: Add dprint server support for eth cores
- PR: #4709
- #4003: added ttnn.group_norm
- PR: #4727
- #4003: added ttnn.silu
- PR: #4731
- #3999: move fallback_ops.silu -> tt_lib.tensor.silu
- PR: #4728
- #4683: Support tracing
- PR: #4656
- #0: Patch for bad state reached when enqueuing trace
- PR: #4735
- Nshanker/remove pow of 2 req for channels size
- PR: #4693
- #4003: added ttnn.pad
- PR: #4733
- #4730: Adding ttnn.concat as fallback
- PR: #4738
- #4003: added ttnn.split
- PR: #4737
- Syrmia/ttnn sweeps
- PR: #4579
- #4347: Move VGG tensors to L1
- PR: #4498
- #4670: Add end to end demo for functional roberta model
- PR: #4718
- #4431: mnist gs_demo benchmark
- PR: #4502
- #4623: lenet gs demo benchmarking [Pending CI]
- PR: #4634
- #4720: Improve folder structure of broken sweep tests
- PR: #4721
- Adding interface to assign dispatch kernels to dispatch functionality and adding kernel to service remote command queue
- PR: #4615
- #4003: Fixing whisper pcc in last layer
- PR: #4753
- #4003: updated ttnn unit tests to assert using higher PCC thresholds
- PR: #4762
- #4761: Adding fallback for repeat_interleave
- PR: #4767
- #4003: simplified the logic in to_layout
- PR: #4766
- #4003: added ttnn.log
- PR: #4769
- #4003: updated ttnn.to_layout and ttnn.pad to do the right thing with padded shape
- PR: #4770
- #0: Fix reference to Python integration test in README
- PR: #4784
- #0: As a quick fix for now, source /etc/rc.local to re-insert number of hugepages back in after starting weka service in perf pipelines
- PR: #4807
- #4003: updated model names
- PR: #4771
- #4617: Matmul went to 0.9998887677925289 with float comparison to torch
- PR: #4812
- #0: Fix bad access to memconfig/device when input tensors are on host
- PR: #4716
- #4503: Demo for functional bloom
- PR: #4554
- #4611: Add end to end test for ViT model with ImageNet data
- PR: #4749
- #4506: SSD gs demo benchmarking
- PR: #4585
- #4504: Add end to end demo for functional t5 model
- PR: #4649
- #4557: Uplift swin model to resolve errors in tests & Add test_perf_accuracy...
- PR: #4774
- #4556: Roberta gs demo benchmarking
- PR: #4627
- #3974: nanogpt uplift and move weights to weka path
- PR: #4221
- #4610: EfficientNet gs demo benchmark
- PR: #4633
- #4003: added more sweeps
- PR: #4813
- #4231: Fine-tune the unary ops for add, sub, div, mul binops with one scalar constant arg
- PR: #4768
- #516: Sanity check tracy artifact generation
- PR: #4545
- #4003: fixed crashing sweep tests
- PR: #4829
- #0: Update get_semaphore to return 16B aligned semaphore addresses
- PR: #4820
- #0: Add tracy dependencies to github actions runner workflows
- PR: #4835
- #4730: Add sweep test for ttnn.concat
- PR: #4830
- Update ops for sharding used in falcon 40b
- PR: #4806
- #4833: Create initial ttnn sweeps with csv artifact upload
- PR: #4834
- #4003: debugging whisper
- PR: #4746
- #4003: Setting all = [] to block whild card imports
- PR: #4832
- TTNN Sharded tensor support
- PR: #4597
- #3662: Impl moreh_clip_grad_norm
- PR: #4743
- #4609: Deit gs demo benchmarking
- PR: #4628
- #4741: Add sum op to tt_dnn
- PR: #4744
- #4622: Yolov3 GS demo Benchmarking
- PR: #4719
- #0: Add weka mount + force hugepage mount with /etc/rc.local in frequent pipelines
- PR: #4827
- #0: Reduce timeout of multi queue single device FD post commit
- PR: #4850
- #4003: Make ttnn sweep tests available from pytest
- PR: #4819
- Add MaxPool2d to ttnn
- PR: #4831
- Ttnn 4761 add sweep for repeat interleave
- PR: #4841
- #0: Remove checkout secret
- PR: #4856
- #4847: Error out when there are insufficient num hugepages
- PR: #4860
- simpler hugepage check
- PR: #4839
- Revert "#4839: simpler hugepage check"
- PR: #4865
- #4862: Disable test_moreh_clip_grad_norm_with_error_if_nonfinite
- PR: #4867
- #4374: Benchmarking for bloom TT model
- PR: #4772
- #4505: Add end to end demo for functional bert model
- PR: #4582
- #4003: updated documentation
- PR: #4876
- #4003: updated concat operation to raise an exception if the dimension is out of range
- PR: #4853
- #0: Losen models perf tolerance for GS
- PR: #4879
- #0: Add more instructions on syseng assets installation + direct users to additional hugepages setup if needed for cloud VMs
- PR: #4884
- #4815: New restart command which safely resets a command queue into a starting state
- PR: #4816
- Revert "#4815: New restart command which safely resets a command queue into a starting state"
- PR: #4887
v0.41.0
Metal
API Changes
tt::tt_metal::detail::GLOBAL_CQ
replaced withtt::tt_metal::detail::GetCommandQueue(Device *device)
- New
num_hw_cqs
parameter to specify underlying number of HW CQs for a givenDevice
:Device *CreateDevice(chip_id_t device_id, const uint8_t num_hw_cqs = 1, const std::vector<uint32_t>& l1_bank_remap = {});
Tools
Profiler
- Integrated Tracy host-side CLI capture and csv report generation with metal’s profiler infrastructure
- Added support for device profiling on ethernet cores for Wormhole systems.
ttNN
Infrastructure
- Updated ttnn documentation with visualizations and examples
- Added padded shape to ttnn
- Renamed
ttnn.nlp
tottnn.transformer
- Updated
ttnn.transformer.split_query_key_value_and_split_heads
to handle most shapes, multi head query and cases when key_value_states are used to compute key and value - Added
ttnn.rms_norm
- Added
ttnn.Shape
and exposed support for padded shape. Simplified broadcasting and reduction operations - Moved
ttnn.Tensor
to C++ - Added debug decorator for ttnn operations
Operations
- Layer operators
layernorm
,conv
,softmax
were optimized for multi-core computation; model specific operators forFalcon7B
were also added. - The operator
normalize_global
was added to the tt_lib.tensor namespace; this transforms the tensor by normalizing elements to the mean and standard deviation of the entire tensor. - The operator
lamb_optimizer
was added to the tt_lib.tensor namespace to help with computing the back-propagation algorithm and weight update for DNN in the training loop.
The following backward operators, for use with back-propagation training loop, have been added to tt_dnn library; they are accessible with suffix _bw
in the tt_lib.tensor namespace.
1. abs
2. add
3. addalpha
4. addcdiv
5. addcmul
6. binary_assign
7. binary_le
8. clamp
9. clamp_max
10. clamp_min
11. div
12. exp
13. fill
14. fill_zero
15. gt
16. log
17. lt
18. max
19. min
20. mul
21. ne
22. neg
23. relu
24. rsqrt
25. rsub
26. sigmoid
27. sqrt
28. sub
29. tan
30. tanh
31. unary_add
32. unary_assign
33. unary_div
34. unary_mul
35. unary_pow
36. unary_sub
37. where
Models
- Added ttnn implementation for Roberta, Whisper, T5-small, and flan-T5-small
- Updated ttnn implementation of Bloom to work with L1 memory, and cleaned up ttnn implementation of BERT
- Updated Mistral implementation to use tilized tensors and operations
- Updated VGG model to load pre-tilized weight tensors and use tilized tensors
- Added benchmarking demo for DistilBert and T5 using SQuAD dataset for question answering
v0.40.0
📦 Uncategorized
- Opt LN_sharded and SMX_sharded
- PR: #4147
- #1919: Turn existing allocator tests into gtests
- PR: #4218
- Agrebenisan/fd perf opt
- PR: #4219
- #3932: Rename unary op args which were input_a -> input, binary ops from input, other -> input_a, input_b
- PR: #4194
- #3971: Fix TSLICE printing truncation when hitting MAX_COUNT
- PR: #4159
- #0: Fix undefined variable error when running with watcher
- PR: #4256
- #4141: Add GetPreferredNOCForDRAMRead, GetPreferredNOCForDRAMWrite and update all ops to use these apis
- PR: #4184
- #3420: fix eth core init L1 bug
- PR: #4262
- #0: Add ttnn founding engineers as CODEOWNERS of functional models
- PR: #4265
- #0: Commonize logic between E2E and device perf functions/scripts. Enable assertions for device perf scripts/ci
- PR: #4248
- Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device.
- PR: #4103
- #0: Add tt-rkim as CODEOWNERS for setup_hugepages.py
- PR: #4266
- #4003: implemented functional t5 model
- PR: #4241
- #3003: commonized variable names across tnn tests. Removed ttnn.experimental. Added ttnn.unary and commonized the import of ttl unary ops
- PR: #4268
- #0: Delete extra text in first docs page about being added to repo
- PR: #4295
- write watcher log to built/ folder rather than kernel subfolder
- PR: #4291
- Add Batch>1 fix for matmul blocking API
- PR: #4296
- #4231: improve unary add, sub, mul and div implementation in SFPU. Add complex polar operator
- PR: #4257
- #3493: sharded tensor support
- PR: #3790
- REVERT #4231: Fine-tune the unary ops to improve performance
- PR: #4312
- #0: Move setup_hugepages.py to release assets
- PR: #4264
- #0: (MINOR) Update VERSION to 0.40.0
- PR: #4315
- #4301: Fix link to announcements in README
- PR: #4317
- #4301: Replace some more instances of Metal w/ Metalium in docs
- PR: #4320
- Llk refactor uplift
- PR: #3908
- #0: Fix TT-Metalium docs link in get_performance.rst
- PR: #4323
- #0: uplift in device code
- PR: #4299
- #4176: uplift umd plus tt_metal changes
- PR: #4333
- init fw once
- PR: #4335
- Merge v2 of untilize_with_halo, maxpool, and conv ops for Resnet-50
- PR: #4325
- Backward ops for Metalium - part-2
- PR: #4322
- #4211: Assert that hugepages number is greater than or equal to required, rather than equal to
- PR: #4381
- Update resnet readme
- PR: #4367
- Add Run Instructions for BERT_large sharded in readme
- PR: #4366
- Add batch 20 for resnet-50
- PR: #4371
- #4376: Support mixed precision for eltwise binary with prescaling
- PR: #4387
- Increase timeout of slow dispatch unit tests and switch to Y_M_D format for ops logs
- PR: #4397
- #0: point umd to main, comestic change
- PR: #4396
- New tilize and straightforward vec gen in matmul kernel examples
- PR: #4261
- #4216: Enable DPrint slow dispatch testing
- PR: #4326
- #4376: Call llk reconfig functions in compute kernel apis for WH
- PR: #4393
- #4336: #4386: Fix interleaved_to_sharded writer waiting on incorrect amount of data for uneven shards
- PR: #4402
- #1433: removed Device* and MemoryConfig from DeviceStorage
- PR: #4411
- #0: Increase fast dispatch post commit timeout and shorten full regressions because we no longer need that much time
- PR: #4412
- #4003: added ttnn.mean, ttnn.rsqrt and ttnn.pow and deleted and got rid of ttl use in ttnn_functional_t5. Updated ttnn.Tensor to store shape as ttnn.Shape
- PR: #4383
- Aliu/load base erisc
- PR: #4394
- #4399: add spell checker script for docs spellchecking
- PR: #4398
- #2134: Uplift UMD
- PR: #4400
- #0: fix memory leaks found in test_sfpu via valgrind
- PR: #4419
- Revert "#4399: add spell checker script spellcheck.sh should be read…
- PR: #4424
- #0: update llk.rst for minor ReST syntax
- PR: #4423
- #2934: Make one CommandQueue and one HW CommandQueue (SysmemWriter) per device
- PR: #4077
- #4003: convert ttl.tennsor.Shape to tuple when using it in torch functions
- PR: #4426
- #4211: Fix HP targeting issues in main from cq-per-device changes
- PR: #4447
v0.39.0
📦 Uncategorized
- #0: Add extra sentence about use cases in somewhat vague terms
- PR: #3975
- #3824: cache weight tensors for mistral
- PR: #3973
- Npetrovic/power fp sweep
- PR: #3959
- #3918: Fix falcon7b perf profiling & add support to load weights from HF when weka is not mounted
- PR: #3863
- Rename KernelID -> KernelHandle and CircularBufferID -> CBHandle
- PR: #3939
- Aliu/erisc cleanup
- PR: #3989
- #3003: ttnn program logging
- PR: #3987
- Watcher output/doc tweaks
- PR: #3998
- #4014: added support for uint16 datatype
- PR: #4015
- #4000: Add links to demo folders in note in first 5 things
- PR: #4012
- #3751: Fix sfpu load/store of ints
- PR: #4016
- enable watcher for stress test actions
- PR: #4021
- #3058: Give first pass at flattening build by getting rid of tt-metal intermediate libs
- PR: #4011
- Revert "#3058: Give first pass at flattening build by getting rid of …
- PR: #4042
- #3219: Added host functions which tilize and untilize bfloat16 vectors
- PR: #4038
- stress test machine config update
- PR: #4025
- #0: update to use concat on device
- PR: #4010
- #3895: ttnn functional optimized Bert
- PR: #4020
- #4014: Fix bug with packing uint16 datatype
- PR: #4050
- #3824: move mistral embedding weights to weka
- PR: #4028
- #3978: Fix readme to instruct running pytest without warnings
- PR: #3984
- Dma/3467 dprint cleanup
- PR: #4018
- #0: identity operator for comparison of SFPU ops
- PR: #4019
- #3058: Add tracy back into build and test with ENABLE_TRACY=1
- PR: #4047
- #3979: Add support for ResNet for weka unmounted machines to download ImageNet
- PR: #4066
- #3990: Remove DPRINT SETW sticky bit
- PR: #4081
- #4041: Add moreh_layernorm op
- PR: #4045
- #4044: Add moreh_softmax, moreh_softmin ops
- PR: #4060
- #3103: profile the SFPU operators
- PR: #4075
- #0: function typo fix
- PR: #4100
- #3211: bug in WH B0 - sum along dim3
- PR: #4099
- Implementation for Bert Sharded Batch 12
- PR: #4093
- #4069: Avoid reading out of bounds in the hugepage
- PR: #4098
- #4014: Add testing for uint16 and uint32 on device
- PR: #4094
- #0: Disable TestPrintRaiseWait gtest until a fix for nondet issue is in
- PR: #4123
- Move hugepages section and refer to public syseng instructions for accelerator-level dependencies
- PR: #4124
- #4055: non-deterministic test_pow_fractional PCC error with watcher enabled
- PR: #4129
- #0: update test_sfpu and profiling conflict
- PR: #4128
- #4043: Add discord link to docs support page + README
- PR: #4134
- Noc on erisc
- PR: #4046
- #3894: backward ops for tt-metal
- PR: #4054
- #3972: Update tracy and device-side profiler docs
- PR: #4138
- #4085: update seed value and re-verify the reported bug
- PR: #4139
- #2860: Init one UMD per MMIO device ID and the remote devices it controls
- PR: #4080
- #4074: Add opened, reopened, synchronize pull_request triggers (default) for static checks pipeline
- PR: #4152
- #0: Ignore /device, not device/ in .gitignore
- PR: #4153
- #4074: Add wording to CONTRIBUTING.md to be open to future forks + to discourage clogging up pipelines with too many PRs
- PR: #4155
- #4053: Upgrade driver from 1.23 to 1.26 in release assets from syseng
- PR: #4133
- #4065: Update pinned python3.8-venv to 20.04.9 because 20.04.8 is gone
- PR: #4135
- #4096: Fix issue with DPRINT server closing too early for some WAITs
- PR: #4130
- #4053: Add chmod ugo+x step in ansible scripts for copying over script assets
- PR: #4167
- #4109: ttnn examples.rst needs update
- PR: #4149
- #4158: support full repeat interleave developed for Mistral
- PR: #4113
- #4076: Add instructions for execution for programming_examples and fix one typo
- PR: #4168
- #0: (MINOR) Bump minor to v0.39.0
- PR: #4175
- #4053: Get rid of FW labels for silicon runner targets
- PR: #4169
- #3752: update ttnn tutorials and make them more descriptive
- PR: #4178
- #3994: Add bfloat16 dtype to sweep tests
- PR: #4090
- #0: update ownership for SFPU ops profiler, and Backward ops code
- PR: #4179
- #3420: move init erisc info to clear l1 call
- PR: #4166
- #3918: Add falcon caching support
- PR: #4185
- #4125: Refactor tests for backward ops
- PR: #4180
- Perf bloom
- PR: #4095
- #4121: Unset TT_METAL_SLOW_DISPATCH_MODE when empty string in yaml. R…
- PR: #4182
- #4079: Remove dprints from op kernels
- PR: #4191
- #4176: uplift umd to include create-eth-map fixes
- PR: #4195
- #4017: Replace static device APIs to query num available devices and num availale pcie devices with standalone host APIs
- PR: #4190
- Fixup some error messages
- PR: #4209
- Rework build system
- PR: #4192
- #4228: Revert umd change to see if seg faults go away
- PR: #4229
- #4003: use if-else instead of try-except in ttnn.reshape and ttnn.permute
- PR: #4235
- #4003: updated ttnn.model_preprocessing to keep the structure of the model weights
- PR: #4196
- #0: Changing name for major places from Metal to Metalium
- PR: #4239
- #4186: Move all assets except for setup_hugepages.py to internal workflows
- PR: #4189
- #4003: run test_performance_of_bloom_for_question_answering using L1 Config and assuming fused softmax
- PR: #4238
- #3003: updated ttnn tests
- PR: #4242
v0.38.0
📦 Uncategorized
- #3820: Trunc fallback op
- PR: #3822
- #3703: Support power with non integer exponent: tt_lib.tensor.power_fp
- PR: #3821
- #308: Add a new test for coverage of previous issue with dprinting float consts from ncrisc
- PR: #3818
- #0: Update UMD submdoule and add cluster wrapper fof get_pcie_base_addr_from_device
- PR: #3688
- ttnn - added Bert
- PR: #3660
- Remove asserts and enable lto for release builds
- PR: #3806
- #2220: Use new UMD apis to get PCIe address ranges
- PR: #3836
- #3814: Use UMD fast write path to update the CQ write pointer, clean up the names of the write/read core APIs so they do not reference DRAM
- PR: #3833
- #0: Fix the repeat interleave doc
- PR: #3817
- #3003: use log_debug instead of log_info for logging operations
- PR: #3845
- Revert "#2220: Use new UMD apis to get PCIe address ranges"
- PR: #3855
- Update get_started.rst
- PR: #3861
- #0: Remove kkwong from CODEOWNERS
- PR: #3864
- #0: Fix scatter op
- PR: #3802
- #3829: Add new void* enqueue apis
- PR: #3860
- #2516: Remove datacopy into uint32_t vector now that we have void* apis
- PR: #3866
- #3640: eltwise binary op perf optimzation
- PR: #3871
- #0: Fix microbenchmark csv artifact path
- PR: #3837
- #3568: Move weigths dtype from bfloat16 to bfp8 in mistral model
- PR: #3775
- Fix SPDX headers to be machine readable
- PR: #3865
- #3804: Split device perf job into separate workflow from E2E perf
- PR: #3879
- #0: Update untilizewithunpad to support some cases of unpadding width in width sharding
- PR: #3878
- #2498: Upload syseng assets as part of release
- PR: #3876
- #0: (MINOR) Update to v0.38.0
- PR: #3883
- #2498: Revert "#2498: REVERT ME - test out release pipeline without r…
- PR: #3884
- Update llama-2 version
- PR: #3840
- #3566: support mistral model for generic batch size
- PR: #3848
- #3718: Link multicasts that use the same path to avoid multiple path reservations in a row
- PR: #3842
- remove UpdateRuntimeArg
- PR: #3877
- #3704: Increase size of trisc1 code hole for now
- PR: #3858
- Doc update for EnqueueReadBuffer
- PR: #3912
- Env variable cleanup
- PR: #3906
- Documenting Compute Kernels API Sprint
- PR: #3653
- #3647: Add fix for test for polyval coeffs generation
- PR: #3923
- #0: mistral code refactor and reuse variables
- PR: #3916
- Codeowners update
- PR: #3907
- #3914: Apply scatter for mistral model
- PR: #3922
- Rewrote ttnn_optimized_multi_head_attention using only ttnn operations
- PR: #3911
- Update models' landing page
- PR: #3940
- #3904: First docs changes for Project Grayskull
- PR: #3919
- Adding compute kernel api docs for untilize, tilize, unpack, tile_move_copy and reg_api
- PR: #3941
- document compute_kernel_api/matmul.h, compute_kernel_api/pack.h, and compute_kernel_api/bcasth.h
- PR: #3937
- #3887: repeat operator implementation
- PR: #3920
- restrict my ownership to host API docs only
- PR: #3944
- #0: update profiling for unary ops
- PR: #3956
- #2220: Redo use new UMD apis to get PCIe address ranges
- PR: #3925
- Merge latest resnet optimizations
- PR: #3935
- Add support for eth kernels full stack
- PR: #3773
- #0: Update docs on device side profiler
- PR: #3958
- #3913: Update mem config for the mistral modules
- PR: #3921
- #3003: updated links to steps 3 and 4 of getting started
- PR: #3964
- #3830: Fix CB failures in perf pipelines
- PR: #3938
- #0: enable test for wormhole, use eps from device
- PR: #3963
- #3003: Adding ttnn_functional_bloom
- PR: #3872
- #3926: refactored run_device_operation to commonize the logic of runn…
- PR: #3966
- #0: add --tile-factor, --use-L1, --use-DRAM, or --help options
- PR: #3967
- Moreh Matmul Op
- PR: #3851
v0.37.0
Metal
API Changes
-
Top-level API to create a Program:
Program CreateProgram();
-
GetRuntimeArgs
now returns a reference to underlying runtime args to allow for in-place updates. This results in noticeably better performance for host-bound workloads:
std::vector<uint32_t>& GetRuntimeArgs(const Program &program, KernelID kernel_id, const CoreCoord &logical_core);
-
Two other variants of updating runtime arguments that results in better host-side performance in certain situations:
void UpdateRuntimeArg(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, size_t offset, uint32_t value);
void SetRuntimeArgs(const Program &program, KernelID kernel, const std::vector< CoreCoord > & core_spec, const std::vector< std::vector<uint32_t> > &runtime_args);
(NOTE: UpdateRuntimeArg is getting removed by next release as it’s use as been superseded by the other functions)
-
GetCircularBufferConfig now returns a const reference:
const CircularBufferConfig &GetCircularBufferConfig(Program &program, CircularBufferID cb_handle);
-
Updating circular buffer config parameters are done through separate 3 functions:
void UpdateCircularBufferTotalSize(Program &program, CircularBufferID cb_handle, uint32_t total_size);
void UpdateCircularBufferPageSize(Program &program, CircularBufferID cb_handle, uint8_t buffer_index, uint32_t page_size);
void UpdateDynamicCircularBufferAddress(Program &program, CircularBufferID cb_handle, const Buffer &buffer);
-
Moved slow/host dispatch APIs to detail namespace:
void LaunchProgram(Device *device, Program &program);
void ReadFromBuffer(const Buffer &buffer, std::vector<uint32_t> &host_buffer);
void WriteToBuffer(const Buffer &buffer, const std::vector<uint32_t> &host_buffer);
Tools - Profiler
- Updating the path for all profiler artifacts to be under generated/profiler folder
ttNN
Infrastructure
- Introduced
ttnn.embedding
to facilitate word embeddings - Added
preprocess_parameters
for generic conversion of torch parameters with caching - Added
ttnn.experimental.gelu
- Added
ttnn.experimental.layer_norm
- Updated program hash to be
std::size_t
and significantly sped up its computation
Operations
- Support for split tensor into two has support for tensor [W, Z, Y, X] shape along Y in addition to existing X.
- Support trunc function has fallback support equivalent to torch.trunc
- Support power function with exponent which is not integral:
tt_lib.tensor.power_fp()
- Support for reshape operator on host for
ROW_MAJOR
layout
Models
Notes not available.
v0.36.1
Metal
Wormhole Bringup
- Added some APIs to query device ethernet connectivity.
- Added first phase of ethernet data movement support, basic unit tests passing on N300.
API Changes
Notes not available.
Tools - Profiler
- Device only and host only profiling options for profile_this.py script
- Examples for fast dispatch device program profiling
Tools - Watcher
- Added kernel names/paths to watcher log file
Extra features
Notes not available.
Eager/ttNN
Infrastructure
- Added initial implementation of TTNN APIs
- Added functions to interface with torch: from_torch, to_torch
- Added functions to move tensor to/from device: to_device, from_device
- Added functions to change the layout of the tensor: to_layout
- Added matmul, add, sub, mul, reshape, permute and softmax operations
- Implemented Multi-Head-Attention using TTNN APIs
- Added 3 tutorials to showcase TTNN
- Updated Documentation to describe TTNN and its APIs
Operations
Following on-device operators are added to tt_lib.tensor
module:
- interleave repeat
- triu
- tril
- rmsnorm
- groupnorm
- silu (update to be first-class unary operator)
Models
- For BERT demo, added loading of cached pre-processed weights (stored as TT tensors) to avoid conversion from Torch to TT tensors.
- Added demo for ResNet that executes on TT hardware. Demo takes images from ImageNet and processes them in batches of 8.
v0.35.0
Metal
Wormhole Bringup
- Extended gtests to run on all available devices in Wormhole systems.
- Single device tests passing on remote chips.
API Changes
-
These 2 functions:
uint32_t CreateSemaphore(Program &program, const CoreRange &core_range, uint32_t initial_value)
uint32_t CreateSemaphore(Program &program, const CoreRangeSet &core_range_set, uint32_t initial_value)
have been replaced by
uint32_t CreateSemaphore(Program &program, const std::variant<CoreRange,CoreRangeSet> &core_spec, uint32_t initial_value)
.
-
These 3 functions:
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreCoord &logical_core, const std::vector<uint32_t> &runtime_args)
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRange &core_range, const std::vector<uint32_t> &runtime_args)
void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRangeSet &core_range_set, const std::vector<uint32_t> &runtime_args)
have been replaced by
void SetRuntimeArgs(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::vector<uint32_t> &runtime_args)
-
These 2 functions:
KernelID CreateDataMovementKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<DataMovementConfig> &config = {})
KernelID CreateComputeKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<ComputeConfig> &config = {})
have been replaced by:
KernelID CreateKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::variant<DataMovementConfig,ComputeConfig> & config)
Tools - Profiler
- Improved
profile_this.py
log management strategy to avoid conservative log folder checks from profiling
Extra features
- Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels. The kernel uses the same
get_arg_val<type>(<index>)
to retrieve it. The host uses the samett_metal::SetRuntimeArgs(Program program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> & core_spec, const std::vector<uint32_t> &runtime_args)
, as the host used to communicate to DataMovement Kernels.
Eager (Ops)
There have been no notable changes to communicate in this release.
Models
- Moved code that implements and tests models from tests/models to top level models folder. In the models folder, models are separated into demos (working models with end2end demo code) and experimental (models that are under development).
- Added implementation of Falcon7B for GS and PyTorch demos for nanoGPT and T5
- Added BERT Large end2end demo on GS (set up for question answering)