Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running MLPerf inference with --device rocm = main.py: error: argument --device: invalid choice: 'rocm' (choose from 'cpu', 'cuda:0') #649

Open
altairBASIC opened this issue Jan 15, 2025 · 0 comments

Comments

@altairBASIC
Copy link

altairBASIC commented Jan 15, 2025

I am running the MLPerf inference benchmark for the Llama2-70b-99 model on a cluster with 6 MI210 GPUs. Below is the command I am using with CM:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev --model=llama2-70b-99 --implementation=reference --framework=pytorch --category=datacenter --scenario=Offline --execution_mode=test --device=rocm --quiet --test_query_count=10 --env.LLAMA2_CHECKPOINT_PATH=/home/intern01/Llama-2-70b-chat-hf

When I try to run the script with the --device rocm option, I get the error message above. It seems that rocm is not recognized as a valid device option, as the script only accepts cpu or cuda:0. This is the full message `CM script::benchmark-program/run.sh

Run Directory: /home/intern01/CM/repos/local/cache/12bee67ce1d840d4/inference/language/llama2-70b

CMD: /home/intern01/CM/repos/local/cache/def32291fe4247de/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /home/intern01/CM/repos/local/cache/b4603ed8799641d8/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device rocm   --total-sample-count 10 --user-conf '/home/intern01/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/8b4fd7479b754685ab7d620e3a9af93e.conf' --output-log-dir /home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /home/intern01/Llama-2-70b-chat-hf 2>&1 | tee '/home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1/console.out'; echo \${PIPESTATUS[0]} > exitstatus

INFO:root:         ! cd /home/intern01/CM/repos/local/cache/dd75d90466a24ac1
INFO:root:         ! call /home/intern01/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run.sh from tmp-run.sh

/home/intern01/CM/repos/local/cache/def32291fe4247de/mlperf/bin/python3 main.py  --scenario Offline --dataset-path /home/intern01/CM/repos/local/cache/b4603ed8799641d8/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device rocm   --total-sample-count 10 --user-conf '/home/intern01/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/8b4fd7479b754685ab7d620e3a9af93e.conf' --output-log-dir /home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /home/intern01/Llama-2-70b-chat-hf 2>&1 | tee '/home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
usage: main.py [-h] [--scenario {Offline,Server}] [--model-path MODEL_PATH]
               [--dataset-path DATASET_PATH] [--accuracy] [--dtype DTYPE]
               [--device {cpu,cuda:0}] [--audit-conf AUDIT_CONF]
               [--user-conf USER_CONF]
               [--total-sample-count TOTAL_SAMPLE_COUNT]
               [--batch-size BATCH_SIZE] [--output-log-dir OUTPUT_LOG_DIR]
               [--enable-log-trace] [--num-workers NUM_WORKERS] [--vllm]
               [--api-model-name API_MODEL_NAME] [--api-server API_SERVER]
main.py: error: argument --device: invalid choice: 'rocm' (choose from 'cpu', 'cuda:0')

CM error: Portable CM script failed (name = benchmark-program, return code = 512)

Could you please advise on how to enable or fix the rocm support for this benchmark? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant