Skip to content

Latest commit

 

History

History
98 lines (67 loc) · 4.15 KB

MODEL_UPDATES.md

File metadata and controls

98 lines (67 loc) · 4.15 KB

Model Updates

Note

Please refer to the front-page README for the latest verified release for each model.

January 13, 2025

  • Integrated Llama3 models (1B/3B/8B/11B/70B) into vLLM fork for all compatible Tenstorrent devices (N150/N300/T3000/Galaxy).
  • Enabled prefill with the maximum context length (131072) when running the Llama3 text models on smaller devices (N150/N300) via chunked prefill.

December 16, 2024

  • Added support for batch size 32 and the maximum context length (131072 tokens).
  • Added full hardware compatibilty for the 1B/3B/8B/11B/70B models (all models are now compatible with N150, N300, QuietBox, Galaxy except for 70B which is only supported on QuietBox and Galaxy due to its large size).

December 2, 2024

  • Improved the decode performance of the 1B/3B/8B/11B text models (for 8B, increased from ~23 t/s/u to ~28 t/s/u) by using BFP4 weights (instead of BFP8) for FF1 and FF3 in the MLP.
  • Added the option to specify custom model configurations, with two defaults for performance and accuracy already provided.

November 18, 2024

  • Created a new shared codebase for the Llama3 family of models, with newly added support for Llama3.2-1B/3B/11B.
  • Added support for the ttnn.experimental.rotary_embedding_llama op in decode mode, eliminating unnecessary device transfers of rotation matrices.

October 21, 2024

  • Enabled prefill workloads to pad to multiples of 1024 instead of powers of 2, improving overall performance for longer sequences

October 7, 2024

  • Added support for continuous batching
  • Added paged caching support for PagedAttention
  • Added a demo which runs with TT-NN tracing (23 t/s/u decode on main)

September 23, 2024

  • Added support for 128K context length using PagedAttention
  • Added a continuous batching demo for running multiple batches of users consecutively
  • Added the option to enable TT-NN tracing

September 9, 2024

Note: This feature is available as of release v0.52.0-rc1

  • Added support for any user prompt size up to a maximum of 32k tokens

August 26, 2024

  • Added data parallel demo for a single Galaxy (32 chips)
  • Refactored all modules and tests to use ttnn multi-device tensors

Note: This feature is available as of release v0.51.0-rc33

  • Added multi-batching support to the demo for running multiple batches of users consecutively
  • Improved end-to-end performance through optimizations to the attention mask in flash decoding

August 12, 2024

  • Added support for flash decoding
  • Updated the demo to support multiple batches of users
  • Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode
  • Added support for decode with 32K context length using flash decoding
  • Fused mixture of experts into a single operation using ttnn.moe

July 29, 2024

  • Added support for LLaMA 3.1 - 8B
  • Runs fast prefill for sequence lengths of up to 512 tokens
  • Supports a maximum context length of 8K tokens
  • Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
  • Prefill and decode now support 8K context length with batch size 16
  • Added prefill support for 4K context length, using scaled dot product attention