You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's useless DQ node in matmul_model_quant_io.onnx
Also have some questions:
The model has 2 inputs and 1 output with large data size, which means huge IO cost for NPU, maybe you can try something different like, make 2nd input as initializer, change inputs to [1, 6, 256, 1500] * [1, 6, 1500, 1500], so output is [1, 6, 256, 256]
in your benchmark script, the time cost includes the 1st inference run. Normally we would skip the 1st inference run as warmup.
The text was updated successfully, but these errors were encountered:
This benchmark is designed to resemble some real world models we depend on
Regarding #2, Whisper (and most other other models) doesn't run the same matrix multiplication over and over again. Instead it runs a bunch of different (large) multiplications in a row. This tends to push weights out of cache, and as such I'd argue that cold-cache performance for a single layer's operations is, if anything, more important than warm-cache performance.
Does your real word models have same IO size? It doesn't make sense that just extract part of the model and test it separately. It makes more sense to test a full model instead.
There's useless DQ node in matmul_model_quant_io.onnx
Also have some questions:
The text was updated successfully, but these errors were encountered: