Llama3 model family - list of required ops for blackhole #16013

mtairum · 2024-12-13T17:30:33Z

This issue lists the ops required for the Llama8B model (and the rest of the llama3 model family).

Looking at the current list of supported Blackhole ops, the following seem to be the ops we'll require to properly support Llama3 family in blackhole:

ttnn.embedding ✅
ttnn.transpose ✅
ttnn.slice ✅
ttnn.interleaved_to_sharded ✅
ttnn.sharded_to_interleaved ✅
ttnn.layer_norm ❌ layer norm cannot allocate CBs for some larger input shapes #16721
ttnn.matmul ✅
ttnn.experimental.nlp_create_qkv_heads_decode ❌
tnn.experimental.rotary_embedding_llama ✅
ttnn.experimental.paged_update_cache ❌
ttnn.transformer.scaled_dot_product_attention_decode (with and without paged_attention flag) ❌
ttnn.experimental.nlp_concat_heads_decode ✅
ttnn.add ✅
ttnn.mul ❌ Blackhole: multiply failing non-deterministically #16662
ttnn.concat ✅
ttnn.argmax ❌ argmax fails for some inputs #16719 argmax returns incorrect shape #16720

Prefill only ops:

ttnn.transformer.scaled_dot_product_attention (if not using chunks)
ttnn.transformer.chunked_scaled_dot_product_attention (if using chunks - not in the traces. This is the same op as previous one but with page table and chunk start index)

Below are the graph trace and the perf trace with extra info on the ops (including memory configs and shapes).

Updated traces [14 Jan 2025]

Please use these new traces for 1B, 8B and 70B llama3 model. These include both prefill and decode and were taken by running the demo.py script with 1L, for 10 iterations.

[OLD] Graph Trace

The list of ops was generated with ttnn graph trace:

ttnn.graph.begin_graph_capture(ttnn.graph.RunMode.NORMAL)

(...) Llama3 - 8B model run

captured_graph = ttnn.graph.end_graph_capture() # End capturing the graph
ttnn.graph.pretty_print(captured_graph)
ttnn.graph.visualize(captured_graph, file_name="graph.svg")

llama8b-1L-op_graph.txt

Ops Perf report

Generated with tracy, this is the ops perf report, which includes the memory configs and input shapes of the required ops

[OLD TRACE, USE THE NEW ONE ABOVE] llama8B-1L-model-ops-perf.csv

Llama3-70B

Additionally, we'll also want support for Llama3-70B ops, which are mostly the same but with different input sizes.
In this section I'll list any new ops separately, and provide the ops perf report.

Additional ops:

ttnn.all_gather
ttnn.reduce_scatter

Ops Perf report

[OLD TRACE, Please use the new one above] llama-70b-1L-ops-pers.csv

The text was updated successfully, but these errors were encountered:

mtairum · 2024-12-13T18:15:13Z

@prajaramanTT Are you the right person to tag on this issue?

In the model team we want to understand what's the current op support in blackhole and what's missing for us to support Llama3.

For now, this issue is listing Llama3-8B, which will run on a single device. We want to provide the list of ops + shapes required so those can be added to the ttnn op sweep tests soon.

Let me know of next steps and please tag other relevant people on this 🙇

mtairum · 2024-12-16T15:57:32Z

FIY @uaydonat

mtairum · 2024-12-16T18:31:12Z

Added ops for Llama3-70B as well.
It's mostly the same as 8B, but since it's exclusively multichip, has CCL ops: all-gather and reduce scatter.

ntarafdar · 2024-12-20T02:39:03Z

@yugi957 has done great work , most TMs are accounted for and work on BH (embedding, slice, transpose, sharded_to_interleaved) . He will test the remaining ones tomorrow (interleaved_to_sharded, concat)

ntarafdar · 2024-12-20T04:50:54Z

@yugi957 has confirmed all TMs that worked on WH for llama work on BH.

mtairum · 2025-01-08T10:37:12Z

@ntarafdar that's great to hear.

What about other ops, such as the experimental ones (rotary embedding, paged dot product, etc.). Any chance to add these to the BH sweeps and test them there?

These will be crucial in supporting transformer-based LLMs in BH.

bbradelTT · 2025-01-08T15:40:03Z

@vsureshTT will look at

ttnn.layer_norm
ttnn.arg_max

I created #16525 to track that effort.

uaydonat · 2025-01-08T17:28:21Z

@cmaryanTT mentioned that someone will be assigned to custom ops (e.g. nlp_create_qkv_heads_decode, rotary_embedding_llama, paged_scaled_dot_product_attention_decode, etc.).

cmaryanTT · 2025-01-08T17:43:29Z

@ntarafdar will be assigning someone in his group to look at the custom ops

cmaryanTT · 2025-01-09T18:11:31Z

@yugi957 and @amorrisonTT will be looking at the custom transformer ops. ETA Monday.
ttnn.experimental.nlp_create_qkv_heads_decode
tnn.experimental.rotary_embedding_llama
ttnn.experimental.paged_update_cache
ttnn.transformer.paged_scaled_dot_product_attention_decode
ttnn.experimental.nlp_concat_heads_decode

amorrisonTT · 2025-01-09T19:54:18Z

These traces don't appear to contain,

ttnn.transformer.paged_scaled_dot_product_attention_decode (although there is scaled_dot_product_attention_decode)
ttnn.add
ttnn.mul

Is this expected?

cmaryanTT · 2025-01-09T20:33:13Z

@amorrisonTT add and mult show up as "BinaryDeviceOperation" with the MULT or ADD type.

cmaryanTT · 2025-01-09T20:35:07Z

I think "paged" in the other op is just a typo

amorrisonTT · 2025-01-09T20:52:55Z

@amorrisonTT add and mult show up as "BinaryDeviceOperation" with the MULT or ADD type.

Thanks!

bbradelTT · 2025-01-09T21:11:27Z

@mtairum the only reference to argmax is on row 1960 of llama-70b-1L-ops-pers.csv and it's for torch argmax and does not specify any inputs.

What do we need to check for ttnn.argmax?

uaydonat · 2025-01-10T02:46:12Z

@amorrisonTT I think paged_scaled_dot_product_attention_decode is the right op. It is contained in 8B trace, not sure why it is not in 70B. @mtairum maybe you did not have Salar's vllm changes?

Also, we runargmax on device for batch=1, and on host if batch>1. I am guessing the traces are for batch=32, that's why you only see torch.argmax. It would be good verify ttnn.argmax since we will ultimately extend it for batch=32 as well, but it is lower priority.

mtairum · 2025-01-10T14:35:17Z

@amorrisonTT add and mult show up as "BinaryDeviceOperation" with the MULT or ADD type.

This is correct.

the only reference to argmax is on row 1960 of llama-70b-1L-ops-pers.csv and it's for torch argmax and does not specify any inputs.

Re: argmax, it's what Utku mentioned.
We use ttnn.argmax with multicore if batch_size==1 (otherwise, it's single core). and the inputs are of shape [1, 1, 32, 128256] for either model.

tt_out_tok = ttnn.argmax(tt_out_rm, dim=3, use_multicore=False if batch_size > 1 else True, output_tensor=tt_out_tok)

and from a new tracy I just generated, the line for argmax:

ArgMax,tt_dnn_device,18972,1,{'dim': '3'; 'output_dtype': 'DataType::UINT32'; 'output_mem_config': 'MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED;buffer_type=BufferType::L1;shard_spec=std::nullopt)'; 'use_multicore': 'true'},,47,,52725463868,53033342829,307878961,656089241697,656089887765,303135332,646068,644470,,644470,,,,,,,1,1,32,128256,ROW_MAJOR,BFLOAT16,DEV_1_L1_INTERLEAVED,1,1,1,32,ROW_MAJOR,UINT32,DEV_1_DRAM_INTERLEAVED,,,,,,,,,,,,,,,,,,,,,,1,1,1,32,ROW_MAJOR,UINT32,DEV_1_DRAM_INTERLEAVED,,,,,,,,,,,,,,,,,[],[],['ttnn/cpp/ttnn/operations/reduction/argmax/device/kernels/reader_argmax_interleaved_multicore.cpp'],['reader_argmax_interleaved_multicore/7829460339933776415/'],0,1344,0,0,0,0,1,1,1,[8208384.0],[128.0],307448723,3593

I think paged_scaled_dot_product_attention_decode is the right op. It is contained in 8B trace, not sure why it is not in 70B. @mtairum maybe you did not have Salar's vllm changes?

We care about both:

scaled_dot_product_attention_decode
paged_scaled_dot_product_attention_decode

It's weird to me why the 70B trace above is not using the paged implementation, I'm pretty sure I run the same config on both. In either case we support both versions, and both have the same input shapes.

amorrisonTT · 2025-01-10T15:27:39Z

I think paged_scaled_dot_product_attention_decode is the right op. It is contained in 8B trace, not sure why it is not in 70B. @mtairum maybe you did not have Salar's vllm changes?

We care about both:

scaled_dot_product_attention_decode

paged_scaled_dot_product_attention_decode

It's weird to me why the 70B trace above is not using the paged implementation, I'm pretty sure I run the same config on both. In either case we support both versions, and both have the same input shapes.

I didn't see paged_scaled_dot_product_attention_decode in either attached trace:

PERF_FILES = ["llama8B-1L-model-ops-perf.csv", "llama-70b-1L-ops-pers.csv"]
df = pd.concat([pd.read_csv(f) for f in PERF_FILES])
df["OP CODE"].value_counts()

OP CODE
Matmul                                 249
InterleavedToShardedDeviceOperation    147
LayerNorm                              147
BinaryDeviceOperation                  147
ReshardDeviceOperation                 145
AllGather                              144
ShardedToInterleavedDeviceOperation    101
Embeddings                              98
Transpose                               98
SliceDeviceOperation                    98
RotaryEmbeddingLlama                    98
PagedUpdateCacheDeviceOperation         98
(torch) __getitem__                     55
NLPCreateHeadsDecodeDeviceOperation     49
ScaledDotProductAttentionDecode         49
NLPConcatHeadsDecodeDeviceOperation     49
ReduceScatter                           48
AllGatherMatmul                         48
(torch) cat                             36
(torch) reshape                         24
(torch) abs                             18
(torch) transpose                       12
(torch) item                            12
(torch) max                             12
(torch) sub                             12
(torch) __get__                          7
(torch) squeeze                          7
(torch) div                              6
(torch) allclose                         6
(torch) permute                          6
ConcatDeviceOperation                    1
(torch) argmax                           1
(torch) embedding                        1
(torch) tolist                           1
Name: count, dtype: int64

uaydonat · 2025-01-13T02:02:09Z

Hmm, for 8B, the op graph has the paged_scaled_dot_product_attention_decode but not the trace.

It might be some wrong naming, because ScaledDotProductAttentionDecode has paged_attention: true in its arguments for both 8B and 70B.

mtairum · 2025-01-13T11:01:06Z

Good point. @uaydonat is right.

I double checked the op kernel and the paged version of op is indeed set by an argument.
It executes the ScaledDotProductAttentionDecode with paged_attention=True

amorrisonTT · 2025-01-13T15:27:10Z

ttnn.experimental.nlp_create_qkv_heads_decode failing with:

PCC value: Max ATOL Delta: 0.9921875, Max RTOL Delta: 79360.0, PCC: 0.49738503615123153, PCC check failed
2025-01-10 23:28:15.657 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_nlp_create_qkv_heads_decode:run_test_create_head_interleaved:66 - PCC value: Max ATOL Delta: 0.98828125, Max RTOL Delta: 18944.0, PCC: 0.5012214626471699, PCC check failed
2025-01-10 23:28:15.660 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_nlp_create_qkv_heads_decode:run_test_create_head_interleaved:70 - PCC value: Max ATOL Delta: 0.9921875, Max RTOL Delta: 7264.0, PCC: 0.4905412229394511, PCC check failed

See #16667

amorrisonTT · 2025-01-13T15:27:44Z

ttnn.experimental.paged_update_cache consistently causes the machine to hang. See #16674

cmaryanTT · 2025-01-13T15:36:19Z

Per @eyonland ADD works, MULT has a non-determinism issue (#16662)

cmaryanTT · 2025-01-13T15:36:56Z

@amorrisonTT can you please open issues for the problems you found

ntarafdar · 2025-01-13T16:28:53Z

@amorrisonTT when you make the update_cache and create_qkv_heads_decode please issue assign it to @cglagovich

amorrisonTT · 2025-01-13T17:47:09Z

ttnn.transformer.scaled_dot_product_attention_decode (with and without paged_attention flag) is failing see #16673

vsureshTT · 2025-01-13T22:48:30Z

@mtairum
The shapes required for layer norm are

input 0: [1,1,32,8192] input 1: [1,1,256,32]
and
input 0: [1,1,32,4096] input 1: [1,1,128,32]

I also linked a truncated version of the spreadsheet below with the layernorm isolated.

LayerNorms1.csv

mtairum · 2025-01-14T14:22:28Z

I've added new traces and two new prefill only ops:

ttnn.transformer.scaled_dot_product_attention (if not using chunks)
ttnn.transformer.chunked_scaled_dot_product_attention(if using chunks - not in the traces. This is the same op as previous one but with page table and chunk start index)

The chunked version was not used in the trace, but it's basically a variation of the main op. This op is separate than the scaled dot product decode that's already being tested.

cglagovichTT · 2025-01-14T15:59:26Z

^ note that above traces also have these prefill-specific ops

ttnn.experimental.nlp_create_qkv_heads
ttnn.experimental.paged_fill_cache
ttnn.experimental.nlp_concat_heads

bbradelTT · 2025-01-14T19:25:06Z

@vsureshTT ran the tests. The test scenarios failed.

I'll update the description with issue numbers.

I tried on WH and it seems that the behaviour is the same as on BH, which means that this may not block the model.

mtairum added the bug Something isn't working label Dec 13, 2024

mtairum self-assigned this Dec 13, 2024

mtairum added blackhole and removed bug Something isn't working labels Dec 13, 2024

mtairum removed their assignment Dec 13, 2024

mtairum changed the title ~~Llama3-8b - blackhole ops~~ Llama3-8b - list of required ops for blackhole Dec 13, 2024

abhullar-tt added this to the BHLD milestone Dec 13, 2024

mtairum changed the title ~~Llama3-8b - list of required ops for blackhole~~ Llama3 model family - list of required ops for blackhole Dec 16, 2024

yugi957 added a commit that referenced this issue Dec 19, 2024

#16013: clean OG embedding sweep and create BH embedding sweep

7ed63fd

yugi957 added a commit that referenced this issue Dec 19, 2024

#16013: sweeps created, awaiting ability to run them on corp

8879c5e

yugi957 added a commit that referenced this issue Dec 20, 2024

#16013: finished all sweeps, tested all for pass

c41f6c8

uaydonat added the LLM_feature label Dec 20, 2024

uaydonat added the P1 label Jan 2, 2025

umadevimcw mentioned this issue Jan 8, 2025

#16510: Eltwise sweep test for add and mul + silu - LLama #16516

Draft

bbradelTT mentioned this issue Jan 8, 2025

Llama3 blackhole ops check for reduce and fused #16525

Closed

amorrisonTT mentioned this issue Jan 9, 2025

DRAFT: sweep tests for llama required ops on blackhole #16564

Closed

6 tasks

amorrisonTT mentioned this issue Jan 10, 2025

Enable test_rotary_embedding_llama for blackhole #16630

Open

6 tasks

amorrisonTT mentioned this issue Jan 13, 2025

Blackhole: ttnn.experimental.paged_update_cache consistently hanging #16674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 model family - list of required ops for blackhole #16013

Llama3 model family - list of required ops for blackhole #16013

mtairum commented Dec 13, 2024 •

edited by bbradelTT

Loading

mtairum commented Dec 13, 2024

mtairum commented Dec 16, 2024

mtairum commented Dec 16, 2024

ntarafdar commented Dec 20, 2024

ntarafdar commented Dec 20, 2024

mtairum commented Jan 8, 2025

bbradelTT commented Jan 8, 2025

uaydonat commented Jan 8, 2025

cmaryanTT commented Jan 8, 2025

cmaryanTT commented Jan 9, 2025

amorrisonTT commented Jan 9, 2025

cmaryanTT commented Jan 9, 2025

cmaryanTT commented Jan 9, 2025

amorrisonTT commented Jan 9, 2025

bbradelTT commented Jan 9, 2025

uaydonat commented Jan 10, 2025

mtairum commented Jan 10, 2025

amorrisonTT commented Jan 10, 2025

uaydonat commented Jan 13, 2025

mtairum commented Jan 13, 2025

amorrisonTT commented Jan 13, 2025 •

edited

Loading

amorrisonTT commented Jan 13, 2025 •

edited

Loading

cmaryanTT commented Jan 13, 2025

cmaryanTT commented Jan 13, 2025

ntarafdar commented Jan 13, 2025 •

edited

Loading

amorrisonTT commented Jan 13, 2025 •

edited

Loading

vsureshTT commented Jan 13, 2025 •

edited

Loading

mtairum commented Jan 14, 2025

cglagovichTT commented Jan 14, 2025

bbradelTT commented Jan 14, 2025

Llama3 model family - list of required ops for blackhole #16013

Llama3 model family - list of required ops for blackhole #16013

Comments

mtairum commented Dec 13, 2024 • edited by bbradelTT Loading

Updated traces [14 Jan 2025]

[OLD] Graph Trace

Ops Perf report

Llama3-70B

Ops Perf report

mtairum commented Dec 13, 2024

mtairum commented Dec 16, 2024

mtairum commented Dec 16, 2024

ntarafdar commented Dec 20, 2024

ntarafdar commented Dec 20, 2024

mtairum commented Jan 8, 2025

bbradelTT commented Jan 8, 2025

uaydonat commented Jan 8, 2025

cmaryanTT commented Jan 8, 2025

cmaryanTT commented Jan 9, 2025

amorrisonTT commented Jan 9, 2025

cmaryanTT commented Jan 9, 2025

cmaryanTT commented Jan 9, 2025

amorrisonTT commented Jan 9, 2025

bbradelTT commented Jan 9, 2025

uaydonat commented Jan 10, 2025

mtairum commented Jan 10, 2025

amorrisonTT commented Jan 10, 2025

uaydonat commented Jan 13, 2025

mtairum commented Jan 13, 2025

amorrisonTT commented Jan 13, 2025 • edited Loading

amorrisonTT commented Jan 13, 2025 • edited Loading

cmaryanTT commented Jan 13, 2025

cmaryanTT commented Jan 13, 2025

ntarafdar commented Jan 13, 2025 • edited Loading

amorrisonTT commented Jan 13, 2025 • edited Loading

vsureshTT commented Jan 13, 2025 • edited Loading

mtairum commented Jan 14, 2025

cglagovichTT commented Jan 14, 2025

bbradelTT commented Jan 14, 2025

mtairum commented Dec 13, 2024 •

edited by bbradelTT

Loading

amorrisonTT commented Jan 13, 2025 •

edited

Loading

amorrisonTT commented Jan 13, 2025 •

edited

Loading

ntarafdar commented Jan 13, 2025 •

edited

Loading

amorrisonTT commented Jan 13, 2025 •

edited

Loading

vsureshTT commented Jan 13, 2025 •

edited

Loading