Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

v1.44.2: Fix for Slurm autoscaler support for future reservations

09 Jan 00:21
484da6e
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Hotfix: Slurm autoscaler support for future reservations by @tpdownes in #3508

Full Changelog: v1.44.1...v1.44.2

Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters

30 Dec 23:36
346d015
Compare
Choose a tag to compare

Release notes v1.44.1

This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.

The release includes 4 blueprints that maximize performance for the machine type:

  1. A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
  2. A GKE blueprint that provisions an A3 Ultra compute node pool
  3. An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
  4. A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking

Example solutions using NCCL are provided for blueprints running under a scheduler.

v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support

19 Dec 22:55
6a19416
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Version Updates ⏫

Bug fixes 🐞

Full Changelog: v1.43.1...v1.44.0

v1.43.1: Patch version bump in OFE

12 Dec 20:02
0a8385b
Compare
Choose a tag to compare

What's Changed

Version Updates ⏫

  • Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358

Full Changelog: v1.43.0...v1.43.1

v1.43.0: GKE and networking enhancements

05 Dec 06:57
7ca11fc
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
  • remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281

New Contributors

Full Changelog: v1.42.0...v1.43.0

v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration

20 Nov 19:27
1a1e22a
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
  • Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
  • Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
  • SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266

New Contributors

Full Changelog: v1.41.0...v1.42.0

v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support

25 Oct 16:58
26fafe0
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

  • Create and use non-default service accounts in GKE by @annuay-google in #3123
  • Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
  • Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
  • Provide explicit project information by @wiktorn in #3060
  • Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
  • Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125

New Contributors

Full Changelog: v1.40.1...v1.41.0

v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning

10 Oct 01:20
eb00254
Compare
Choose a tag to compare

What's Changed

Other changes

  • Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115

Full Changelog: v1.40.0...v1.40.1

v1.40.0: A3 Mega and A3 High families supported in GKE

03 Oct 21:13
f9f9256
Compare
Choose a tag to compare

What's Changed

Important

All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".

Key New Features 🎉

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

Other changes

  • NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075

New Contributors

Full Changelog: v1.39.0...v1.40.0

v1.39.0: Slurm reservations during maintenance windows, Improved GKE Support, removed CentOS 7 references

12 Sep 19:38
7699f5d
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Add slurmgcp-managed infix to resource policy name by @mr0re1 in #2892
  • Move pytest and other package installation to make by @annuay-google in #2890
  • Prevent use of google provider 6.0 where breaking changes are in use by @tpdownes in #2978
  • Fix local_ssd_config issue that forces node-pool recreation by @sharabiani in #2968
  • kubernetes provider added to gke-cluster module by @sharabiani in #2985
  • Fix for cleanup script. The last input is optional by @cdunbar13 in #2993
  • Catch "None" fields in slurm job datetime data for BigQuery by @fdmalone in #2992

Other changes

New Contributors

Full Changelog: v1.38.0...v1.39.0