Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontier: Increase PMPI_Init timeout, allow NVME and use Python 3.10. #335

Merged
merged 5 commits into from
Oct 18, 2023

Conversation

joaander
Copy link
Member

For Frontier:

  • Increase PMPI_Init timeout (fails after 180 seconds by default).
  • Add scripts and documentation to support Python environments in node-local NVME storage.
  • Unrelated: Update to Python 3.10 as it is now available on Frontier.

When launching 64 node jobs with hoomd-validation on Frontier, Python takes ~500 seconds to import flow and other packages from /ccs/proj. This exceeds the PMPI_Init timeout and causes jobs to fail with:

Wed Oct 11 15:15:14 2023: [PE_1295]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=40, pes_this_node=56, timeout=180 secs
Wed Oct 11 15:15:14 2023: [PE_1318]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=40, pes_this_node=56, timeout=180 secs
Wed Oct 11 15:15:14 2023: [PE_1318]:_pmi_mmap_init:Failed to setup PMI mmap.Wed Oct 11 15:15:14 2023: [PE_1318]:globals_init:_pmi_mmap_init ret
urned -1
MPICH ERROR [Rank 0] [job id unknown] [Wed Oct 11 15:15:14 2023] [frontier03965] - Abort(1091855) (rank 0 in comm 0): Fatal error in PMPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(170):  MPID_Init(441).......:  MPIR_pmi_init(110)...: PMI_Init returned 1
...

Increasing the timeout allows jobs to run, but wastes many minutes importing packages at the start of each job. The new instructions direct the user to store the environment in a tar file on Orion and unpack it to NVME at the start of the job. This reduces the import time in hoomd-validation down to ~20 seconds, including the time it takes to unpack the tar file (typically 1-2 seconds).

Using the tar file is optional: source /ccs/proj/.../environment.sh continues to function, but without the performance benefit.

@joaander joaander requested a review from a team October 18, 2023 15:49
@joaander joaander merged commit 76d44c8 into trunk Oct 18, 2023
7 checks passed
@joaander joaander deleted the frontier-nvme branch October 18, 2023 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant