You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just found interesting behavior if OPENBLAS_NUM_THREADS is not set at all - I expected that Caffe CPU version would use OpenMP and multiple threads with default build, but it doesn't seem to be the case, or the OpenMP strategy is not optimal. We just noticed that on multi-core ARM-based machine with Ubuntu when env var "OPENBLAS_NUM_THREADS" is not set at all, or if we set it to "OPENBLAS_NUM_THREADS":1, the difference in performance can be 6x ! Further changes of this var, i.e. forcing various threads improve performance further but not as dramatic, i.e. we can find an optimal parameter to get about 2x further performance.
We will need to understand the default behavior of this var in Caffe to decide how to add it to CK workflow (i.e. should we explicitly add this var to meta.json?) I expected that when not defined OpenBLAS would turn on adaptive OpenMP strategy (i.e. to dynamically select number of threads depending on current system load), but it doesn't seem to be the case.
In contrast, on multicore x86_64 the performance difference when this var is not set or set to 1 is small (around 20%) and we can improve performance further by autotuning number of threads to around 50%.
When we understand the behavior of this OpenBLAS parameter, we should add this param to CK workflow for crowd-tuning and reproducibility ...
Extra note: just found on Google that someone had similar issues:
So, in the future, we should add this parameter to our crowd-tuner and share best results for different machines in cKnowledge.org/repo - I added it to our big ToDo list ;) ...
The text was updated successfully, but these errors were encountered:
(Moving ticket from mlcommons/ck#76)
Just found interesting behavior if OPENBLAS_NUM_THREADS is not set at all - I expected that Caffe CPU version would use OpenMP and multiple threads with default build, but it doesn't seem to be the case, or the OpenMP strategy is not optimal. We just noticed that on multi-core ARM-based machine with Ubuntu when env var "OPENBLAS_NUM_THREADS" is not set at all, or if we set it to "OPENBLAS_NUM_THREADS":1, the difference in performance can be 6x ! Further changes of this var, i.e. forcing various threads improve performance further but not as dramatic, i.e. we can find an optimal parameter to get about 2x further performance.
We will need to understand the default behavior of this var in Caffe to decide how to add it to CK workflow (i.e. should we explicitly add this var to meta.json?) I expected that when not defined OpenBLAS would turn on adaptive OpenMP strategy (i.e. to dynamically select number of threads depending on current system load), but it doesn't seem to be the case.
In contrast, on multicore x86_64 the performance difference when this var is not set or set to 1 is small (around 20%) and we can improve performance further by autotuning number of threads to around 50%.
When we understand the behavior of this OpenBLAS parameter, we should add this param to CK workflow for crowd-tuning and reproducibility ...
Extra note: just found on Google that someone had similar issues:
So, in the future, we should add this parameter to our crowd-tuner and share best results for different machines in cKnowledge.org/repo - I added it to our big ToDo list ;) ...
The text was updated successfully, but these errors were encountered: