v1.30.0 - Cloud HPC Toolkit A3 VM + NeMo Framework Solution
What's Changed
Key New Features 🎉
- Introduction of the Cloud HPC Toolkit A3 VM family blueprint featuring
- A Slurm cluster composed of A3 VMs each with 8 NVIDIA H100 GPUs
- An example for running the NVIDIA NeMo framework
- An example for running the common nccl-tests benchmark
Module Improvements 🔨
- Add support for startup script per nodeset by @mr0re1 in #2296
- Allocate IP for vm-instance by @lemaitre-aneo in #2219
- Add random hex to batch job id and specify id when submitting by @nick-stroud in #2259
- HTCondor: support user-managed secret replication by @tpdownes in #2340
- Support setting system default web proxy in Windows startup by @tpdownes in #2351
Improvements 🛠
- Add TPU v4 blueprint and tutorial to demonstrate running TPU workload by @harshthakkar01 in #2287
- Update parameters for TPU nodeset module and add precondition checks and bump TPU to v3 by @harshthakkar01 in #2293
- Add Slurm v6 version for image builder blueprint by @harshthakkar01 in #2297
- Allow
ghpc deploy blueprint.yaml
by @mr0re1 in #2323 - Slurm GCP version update; will cooldown before deleting orphan nodes by @nick-stroud in #2322
- Add SlurmGCP v6 example of slurm compatible with startup scripts and integration test by @harshthakkar01 in #2346
Version Updates ⏫
Bug fixes 🐞
- Added enable_devel for packer build to fix issue with bp by @cdunbar13 in #2334
New Contributors
- @lemaitre-aneo made their first contribution in #2219
Full Changelog: v1.29.0...v1.30.0