Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

Latest commit

 

History

History
63 lines (54 loc) · 3.71 KB

README.md

File metadata and controls

63 lines (54 loc) · 3.71 KB

Deprecation notice

As of May 14, 2024, we have deprecated this solution in favor of github.com/crusoecloud/slurm. This solution will remain available in a read-only mode; however, we are not able to provide ongoing maintenance for this solution.

Create an Autoscaled-enabled SLURM Cluster on Crusoe

This is a reference design implementation of SLURM on Crusoe Cloud. This implementation has support for multiple paritions and specific nodegroups within those partitions. The cluster also has support to a cluster autoscaler that will provision instances on Crusoe based on demand on the cluster. The terraform script main.tf is the main entry point which will just provision the headnode and using the SLURM Power Plugin will start additional compute nodes based on jobs submitted to the headnode.

Description of the Architecture

The terraform script will simply provision a headnode, the headnode-bootstrap.shscript will perform the following:

  1. Will scan for number of ephemeral drives and mount it as RAID0 for number of drives > 1 at mount point /raid0 for instances with a single nvme local epehmeral drive it will be mounted as /nvme and the scratch directory will inside that path
  2. A NFS server is also setup at /nfs/slurm which provides the SLURM binaries, libraries and helper code to the ephemeral compute nodes
  3. Download and install SLURM source tree. The SLURM version is controlled by the bootstrap script to ensure its supported on Crusoe. Changing the version in the repo is NOT supported, unless is validated by Crusoe.

Support for NVIDIA Enroot/Pyxis

Included in the deployment is support for enroot and Pyxis. Purpose built to support native container orchestration within SLURM to run container images across the cluster. All enroot images are on the /scratch directory of each node in the cluster. Adding credentials to access various registries can be done by editing a $HOME/enroot/.credentials file.

Monitoring

heatmap The headnode is hosting a Telegraf-Prometheus-Grafana(TPG)-stack, and each worker runs Telegraf and creates a /metrics endpoint from which the headnode Prometheus will poll. metrics

Deployment

Step 1. Install Terraform On your client machine where you deploy the headnode of the cluster install Terraform following the instructions here.

Step 2. Install the Crusoe Cloud CLI Install the Crusoe Cloud ClI following these instructions, setup the authentication layer by creating ssh keys and API tokens.

Step 3. Clone repo and create a variables.tf File

git clone https://github.com/crusoecloud/crusoe-hpc-slurm.git
cd crusoe-hpc-slurm

Your variables.tf contains the following:

variable "access_key" {
   description = "Crusoe API Access Key"
   type        = string
   default     = "<ACCESS_KEY>"
 }
variable "secret_key" {
   description = "Crusoe API Secret Key"
   type        = string
   default     = "<SECRET_KEY>"
 }

Step 4. In the main.tf file replace the local values with provide a path for the private ssh key and the string of the public key. And choose an instance type for the headnode

locals {
  my_ssh_privkey_path="/Users/amrragab/.ssh/id_ed25519"
  my_ssh_pubkey="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIdc3Aaj8RP7ru1oSxUuehTRkpYfvxTxpvyJEZqlqyze [email protected]"
  headnode_instance_type="a100-80gb.1x"
}

Step 5. Execute the terraform script

terraform init
terraform plan
terraform apply