Releases: GoogleCloudPlatform/cluster-toolkit
v1.44.2: Fix for Slurm autoscaler support for future reservations
Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters
Release notes v1.44.1
This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.
The release includes 4 blueprints that maximize performance for the machine type:
- A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
- A GKE blueprint that provisions an A3 Ultra compute node pool
- An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
- A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking
Example solutions using NCCL are provided for blueprints running under a scheduler.
v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support
What's Changed
Key New Features 🎉
- update terraform provider to 6.12.0 by @ighosh98 in #3356
- Add future reservation support by @abbas1902 in #3227
- Update terraform provider to 6.13.0 by @alyssa-sm in #3367
- GKE clusters can now be duplicated by changing only the deployment name by @annuay-google in #3322
- GPU-VPC module by @cdunbar13 in #3391
Module Improvements 🔨
- Add dynamic setup of gpu_limit in gke-job-template module by @mohitchaurasia91 in #3319
- Add reservations to vm-instance by @cdunbar13 in #3327
- Revert "integrate tas plugin bug fixes" by @ighosh98 in #3344
- Add change to conditionally perform pip install based on gke_node_pool machine_type by @mohitchaurasia91 in #3341
- Update gke-cluster module addon-config for enabling parallelstrore csi driver by @mohitchaurasia91 in #3357
- make upgrade settings configurable by @ighosh98 in #3359
- gke v1.31 added to acceptable list for a3-mega by @sharabiani in #3388
- An option added to disable/enable workload script execution by @sharabiani in #3389
- Adding network_profile to the VPC modules by @cdunbar13 in #3387
- TopologyAwareScheduling enabled by default for Kueue v0.9.1 by @sharabiani in #3396
Improvements 🛠
- Implemented kueue tests by @ighosh98 in #3315
- add support for kueue v0.9.1 by @ighosh98 in #3321
- Bump jobset version to 0.7.1 by @annuay-google in #3318
- Update custom TAS scripts to support A3U by @ighosh98 in #3295
- integrate tas plugin bug fixes by @ighosh98 in #3339
- Add multi-mount parallelstore support by @harshthakkar01 in #3256
- Update image-builder.yaml link in README by @ighosh98 in #3373
- Update OpenFOAM tutorial by @wkharold in #3342
- [Cherry Pick] add reservations for kueue integration tests by @ighosh98 in #3431
- [Cherry Pick] Update README and related test setup for GKE managed parallelstore blueprint by @mohitchaurasia91 in #3437
Version Updates ⏫
- Promote the new nic-types in vm-instance by @cdunbar13 in #3288
Bug fixes 🐞
- Fix default ssd config to ephemeral storage by @ankitkinra in #3317
- Set 'enable_private_endpoint' to false by @pawloch00 in #3364
Full Changelog: v1.43.1...v1.44.0
v1.43.1: Patch version bump in OFE
What's Changed
Version Updates ⏫
- Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358
Full Changelog: v1.43.0...v1.43.1
v1.43.0: GKE and networking enhancements
What's Changed
Key New Features 🎉
- add support for kueue v0.9.0 to enable Topology Aware Scheduling by @ighosh98 in #3277
- RDMA networking and multi-zone cluster support for GKE A3-Ultra by @annuay-google in #3299
- Improved reservations support for GKE A3-Ultra by @annuay-google in #3298
Module Improvements 🔨
- Add cloud rdma drivers into startup script module by @abbas1902 in #3289
- add support for enable DCGM monitoring in GKE by @chengcongdu in #3279
- add GKE support for node local dns by @chengcongdu in #3280
- Update topology-scheduler-scripts.yaml by @thisSIDEofRANDOM in #3286
Improvements 🛠
- add firewall to allow tcp traffic for parallelstore by @chengcongdu in #3262
- Allow specifying GKE's system node pool disk properties by @ankitkinra in #3268
- Adds Cluster Toolkit Dockerfile for backend integration with XPK by @RachaelSTamakloe in #3237
- Add cluster and hostname as cloud ops labels by @abbas1902 in #3163
- Print job detail for gke-storage-parallelstore integration test by @mohitchaurasia91 in #3264
- Add max_pods_per_node for GKE cluster and nodepool by @pawloch00 in #3197
- Fix image-building.md link by @jemish-google in #3287
- Updating Cluster Toolkit Dockerfile README by @RachaelSTamakloe in #3290
- Updating Go version in Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3301
- Add Integration Test to Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3302
- XPK blueprint updates to make it more compatible with the tool by @ankitkinra in #3192
- expanding subnetwork_cidr_suffix by @ighosh98 in #3304
- Updating Slurm-GCP to 6.8.6 by @cdunbar13 in #3336
Bug fixes 🐞
- Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
- remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281
New Contributors
- @pawloch00 made their first contribution in #3197
- @thisSIDEofRANDOM made their first contribution in #3286
- @jemish-google made their first contribution in #3287
Full Changelog: v1.42.0...v1.43.0
v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration
What's Changed
Key New Features 🎉
- Add support for custom Docker daemon configuration by @tpdownes in #3201
- Adopt local SSD storage for A3 docker images by @tpdownes in #3206
- Adopt google Terraform plugin v6.10.0 and drop support for 5.x by @tpdownes in #3189
- Add support to perform GCP maintenance as slurm job by @harshthakkar01 in #3152
- Add support for Filestore deletion protection by @tpdownes in #3183
Module Improvements 🔨
- Updating notebook module to use workbench_instance by @jrossthomson in #3139
- Initial commit for new logging output by @cdunbar13 in #3150
- SlurmGCP. "All or nothing" bulk insert on requests with placements by @mr0re1 in #3157
- Remove redundant provisioner for printing image name by @cdunbar13 in #3151
- Add direct Terraform support for Slurm SchedulerParameters and PrivateData by @tpdownes in #3164
- Add
use_job_duration
option by @abbas1902 in #3142 - Improvements for CloudSQL by @wiktorn in #3147
- Improve Error Message with Reservation Validation by @arajmane-g in #3174
Improvements 🛠
- Use local paths to embedded modules throughout Toolkit by @tpdownes in #3102
- Update default value for subnetwork_project to null by @alyssa-sm in #3193
- Gke update default taints for user node pools by @ankitkinra in #3200
- Update MTU for a3 mega for GKE based on best practices by @ankitkinra in #3175
- add training example for gke parallelstore blueprint by @chengcongdu in #3181
- Update maintenance.py to support additional format by @alyssa-sm in #3208
- Allow latest Terraform google plugin by @tpdownes in #3213
- update a3 machines local ssd to use nvme instead of scsi for better performance by @chengcongdu in #3232
- Improve fetching and caching job details by @harshthakkar01 in #3194
- SlurmGCP. Add
set -e
to prolog mux by @mr0re1 in #3215 - add gpu health check in prolog and epilog by @NinaCai in #3134
Deprecations 💤
- Delete the new-project module to support adoption of TPG v6 by @RachaelSTamakloe in #3171
- Delete Daos Example Blueprints to support adoption of TPG v6 by @RachaelSTamakloe in #3172
Version Updates ⏫
- Bump integration test to support Go 1.23 by @mohitchaurasia91 in #3154
- Bump go version 1.21 -> 1.22 by @mohitchaurasia91 in #3156
- Update bucket module within Slurm controller module by @tpdownes in #3161
- update vm-instance module to support TPG v6 by @RachaelSTamakloe in #3166
- Update IP address module within VPC module by @tpdownes in #3160
- update Batch module to be compatible with TPG v6 by @RachaelSTamakloe in #3187
- update HTCondor modules to be compatible with TPG v6 by @RachaelSTamakloe in #3186
- Update Slurm-GCP v5 to 5.12.1 by @tpdownes in #3185
- Update workload-identity submodule from v29 to v34 by @RachaelSTamakloe in #3196
- Update ml-slurm examples to use recent copies of pytorch and tensorflow by @tpdownes in #3226
- Make gke-node-pool compatible with TPG 6.x by @tpdownes in #3230
Bug fixes 🐞
- Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
- Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
- Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
- SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266
New Contributors
- @linsword13 made their first contribution in #3211
- @NinaCai made their first contribution in #3134
Full Changelog: v1.41.0...v1.42.0
v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support
What's Changed
Key New Features 🎉
New Modules 🧱
- resource-policy module implemented by @sharabiani in #3066
- gke-topology-scheduler module implemented by @sharabiani in #3080
- add GKE support for parallelstore through gke-storage module by @chengcongdu in #3120
Module Improvements 🔨
- Added compatibility check for GPUDirect and GKE version by @sharabiani in #3079
- Support template file for kueue configuration in kubectl-apply module by @sharabiani in #3111
- Implement xpk-gke-a3-megagpu blueprint by @sharabiani in #3108
- Use sackd for the login nodes by @mr0re1 in #3126
- gke-node-pool default name conflict fixed by @sharabiani in #3127
- improve dws_flex ux by @abbas1902 in #3122
- Include deployment name in Spack and Ramble bucket names (like startup-script) by @rohitramu in #3136
Improvements 🛠
- Create and use non-default service accounts in GKE by @annuay-google in #3123
- Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
- Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129
Deprecations 💤
- Freeze slurm-gcp v5 hybrid blueprints with the latest cluster toolkit version support by @harshthakkar01 in #3117
- Update Slurm-gcp v5 deprecation details by @harshthakkar01 in #3118
- Update badge for slurm-gcp v5 and slurm-gcp v6 by @harshthakkar01 in #3116
Version Updates ⏫
- Update A3-High NeMo to 24.07 and NCCL solution to latest recommended values by @akiki-liang0 in #3130
- Update Slurm-GCP to 6.8.2 by @tpdownes in #3132
Bug fixes 🐞
- Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
- Provide explicit project information by @wiktorn in #3060
- Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
- Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125
New Contributors
- @akiki-liang0 made their first contribution in #3130
- @ighosh98 made their first contribution in #3124
Full Changelog: v1.40.1...v1.41.0
v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning
What's Changed
Other changes
- Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115
Full Changelog: v1.40.0...v1.40.1
v1.40.0: A3 Mega and A3 High families supported in GKE
What's Changed
Important
All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".
Key New Features 🎉
- GKE A3 High blueprint and GKE A3 Mega blueprint with automated GPU networking performance enhancements
- Add enable-maintenance-reservation flag in slurm to control reservation for scheduled maintenance by @harshthakkar01 in #2987
- adding documentation for versioned blueprint feature by @RachaelSTamakloe in #3055
- adding unit test for version blueprint caching mechanism by @RachaelSTamakloe in #3052
New Modules 🧱
- implement kubectl-apply module by @sharabiani in #2980
Module Improvements 🔨
- Default to zonal bulkInsert by @mr0re1 in #3005
- Add machine type availability checks by @annuay-google in #3003
- add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload by @chengcongdu in #3012
- support ghpc_stage function in kubectl-apply module by @sharabiani in #3036
- Validate Reservations in GKE Blueprints by @arajmane-g in #3024
- Fix multivpc missing region by @wiktorn in #3046
- Add initial_node_count support to gke-node-pool by @sharabiani in #3068
Improvements 🛠
- Update gVNIC driver in A3 Mega solution by @tpdownes in #2957
- Implement udev-based approach to mounting aperture devices by @tpdownes in #2955
- Update Debian 12 image in A3 Mega solution by @tpdownes in #2958
- adding module cache to prevent repeated module downloads during modul… by @RachaelSTamakloe in #3010
- add additional vpc validation for a3/a3mega machine by @chengcongdu in #3049
- Adds option to allow Kueue/Jobset to be installed on a GKE cluster via blueprints by @ankitkinra in #3017
- update readme for gpudirect by @chengcongdu in #3059
Deprecations 💤
- SlurmGCP V6. Remove CentOS7 image support. by @mr0re1 in #3038
- removing deprecated spack setup variables by @RachaelSTamakloe in #3040
- removing deprecated ramble setup variables by @RachaelSTamakloe in #3041
Version Updates ⏫
- Update NeMo 23.11 to 24.07 by @akiki-liang0 in #3090
Bug fixes 🐞
- Retry mounting daos container by @harshthakkar01 in #3045
- add argparse dependency to cloud build by @chengcongdu in #3057
- Allow users to provide a commit hash instead of git tag for Spack and Ramble installations by @rohitramu in #3073
- resolving error when var.initial_node_count is null by @RachaelSTamakloe in #3081
- A3 High blueprint prolog solution updates by @tpdownes in #3088
Other changes
- NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075
New Contributors
- @koallison made their first contribution in #3075
- @akiki-liang0 made their first contribution in #3090
Full Changelog: v1.39.0...v1.40.0
v1.39.0: Slurm reservations during maintenance windows, Improved GKE Support, removed CentOS 7 references
What's Changed
Key New Features 🎉
- Add reservation support in slurm sync for scheduled maintenance by @harshthakkar01 in #2880
- Support multivpc with GKE by @sharabiani in #2797
- adding optional fields to redirect use of embedded modules to pull fr… by @RachaelSTamakloe in #2945
Module Improvements 🔨
- Make CloudSQL secret replication configurable by @dgouju in #2828
- GKE Blueprints to support reservations by @arajmane-g in #2891
- Expose maintenance interval as a blueprint setting for node pools in GKE by @annuay-google in #2971
- Support named placements in GKE node pools by @arajmane-g in #2969
- Add machine type availability checks to slurm-gcp-v6-nodeset by @annuay-google in #2962
- Revisit the Reservation Interface for GKE Blueprints by @arajmane-g in #2997
Improvements 🛠
- Add
sort_nodes.py
by @mr0re1 in #2853 - replacing centos7 with rocky8 in vm-instance modules by @RachaelSTamakloe in #2900
- replacing centos7 with rocky8 in nfs-server modules by @RachaelSTamakloe in #2901
- replacing centos7 with rocky8 in packer modules by @RachaelSTamakloe in #2899
- Update batch image to hpc-rocky-linux-8 by @ankitkinra in #2884
- OFE - various updates and fixes by @scott-nag in #2921
- Don't set
automaticRestart: false
by @mr0re1 in #2981
Bug fixes 🐞
- Add
slurmgcp-managed
infix to resource policy name by @mr0re1 in #2892 - Move pytest and other package installation to make by @annuay-google in #2890
- Prevent use of google provider 6.0 where breaking changes are in use by @tpdownes in #2978
- Fix local_ssd_config issue that forces node-pool recreation by @sharabiani in #2968
- kubernetes provider added to gke-cluster module by @sharabiani in #2985
- Fix for cleanup script. The last input is optional by @cdunbar13 in #2993
- Catch "None" fields in slurm job datetime data for BigQuery by @fdmalone in #2992
Other changes
- Use local-ssd for enroot temp space. by @samskillman in #3011
New Contributors
- @scott-nag made their first contribution in #2921
- @abbas1902 made their first contribution in #2956
- @fdmalone made their first contribution in #2992
Full Changelog: v1.38.0...v1.39.0