-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 model family - list of required ops for blackhole #16013
Comments
@prajaramanTT Are you the right person to tag on this issue? In the model team we want to understand what's the current op support in blackhole and what's missing for us to support Llama3. For now, this issue is listing Llama3-8B, which will run on a single device. We want to provide the list of ops + shapes required so those can be added to the ttnn op sweep tests soon. Let me know of next steps and please tag other relevant people on this 🙇 |
FIY @uaydonat |
Added ops for Llama3-70B as well. |
@yugi957 has done great work , most TMs are accounted for and work on BH (embedding, slice, transpose, sharded_to_interleaved) . He will test the remaining ones tomorrow (interleaved_to_sharded, concat) |
@yugi957 has confirmed all TMs that worked on WH for llama work on BH. |
@ntarafdar that's great to hear. What about other ops, such as the experimental ones (rotary embedding, paged dot product, etc.). Any chance to add these to the BH sweeps and test them there? These will be crucial in supporting transformer-based LLMs in BH. |
@vsureshTT will look at
I created #16525 to track that effort. |
@cmaryanTT mentioned that someone will be assigned to custom ops (e.g. nlp_create_qkv_heads_decode, rotary_embedding_llama, paged_scaled_dot_product_attention_decode, etc.). |
@ntarafdar will be assigning someone in his group to look at the custom ops |
@yugi957 and @amorrisonTT will be looking at the custom transformer ops. ETA Monday. |
These traces don't appear to contain,
Is this expected? |
@amorrisonTT add and mult show up as "BinaryDeviceOperation" with the MULT or ADD type. |
I think "paged" in the other op is just a typo |
Thanks! |
@mtairum the only reference to argmax is on row 1960 of llama-70b-1L-ops-pers.csv and it's for torch argmax and does not specify any inputs. What do we need to check for ttnn.argmax? |
@amorrisonTT I think Also, we run |
This is correct.
Re: argmax, it's what Utku mentioned.
and from a new tracy I just generated, the line for argmax:
We care about both:
It's weird to me why the 70B trace above is not using the paged implementation, I'm pretty sure I run the same config on both. In either case we support both versions, and both have the same input shapes. |
I didn't see
|
Hmm, for 8B, the op graph has the It might be some wrong naming, because |
Good point. @uaydonat is right. I double checked the op kernel and the paged version of op is indeed set by an argument. |
See #16667 |
|
@amorrisonTT can you please open issues for the problems you found |
@amorrisonTT when you make the update_cache and create_qkv_heads_decode please issue assign it to @cglagovich |
|
@mtairum input 0: [1,1,32,8192] input 1: [1,1,256,32] I also linked a truncated version of the spreadsheet below with the layernorm isolated. |
I've added new traces and two new prefill only ops:
The chunked version was not used in the trace, but it's basically a variation of the main op. This op is separate than the scaled dot product decode that's already being tested. |
^ note that above traces also have these prefill-specific ops
|
@vsureshTT ran the tests. The test scenarios failed. I'll update the description with issue numbers. I tried on WH and it seems that the behaviour is the same as on BH, which means that this may not block the model. |
This issue lists the ops required for the Llama8B model (and the rest of the llama3 model family).
Looking at the current list of supported Blackhole ops, the following seem to be the ops we'll require to properly support Llama3 family in blackhole:
Prefill only ops:
Below are the graph trace and the perf trace with extra info on the ops (including memory configs and shapes).
Updated traces [14 Jan 2025]
Please use these new traces for 1B, 8B and 70B llama3 model. These include both prefill and decode and were taken by running the
demo.py
script with 1L, for 10 iterations.Llama3-1B: llama-1b-1L-demo-ops.csv
Llama3-8B: llama-8b-1L-demo-ops.csv
Llama3-70B: llama-70b-1L-demo-ops.csv
[OLD] Graph Trace
The list of ops was generated with ttnn graph trace:
llama8b-1L-op_graph.txt
Ops Perf report
Generated with tracy, this is the ops perf report, which includes the memory configs and input shapes of the required ops
Llama3-70B
Additionally, we'll also want support for Llama3-70B ops, which are mostly the same but with different input sizes.
In this section I'll list any new ops separately, and provide the ops perf report.
Additional ops:
Ops Perf report
The text was updated successfully, but these errors were encountered: