-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLURM observability cookbook #3
base: main
Are you sure you want to change the base?
Conversation
|
||
#### 2. *Create Users and Directories* | ||
|
||
_On all nodes_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For commands that are required to be run on all compute nodes, can we recommend using a tool like pssh
or clush ( https://clustershell.readthedocs.io/en/latest/tools/clush.html ) to make it easier to get started with large cluster management? Perhaps this warrants another mini section on setting up a hostfile for parallel VM management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After internal discussion, we suggest to use Ansible for this. In fact, the SLURM observability role is submitted as a PR (I'm going to remove draft PR mode to ready to review today after final testing).
The reason we propose that is that Ansible is already used widely in Crusoe, it's also present in GPUd cookbook. In order not to overload the main SLURM repo we can extract it to be its own playbook too.
- Operating System: Ubuntu 20.04 or newer | ||
- Firewall: Ability to configure firewall rules for required ports | ||
|
||
### Security Considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these inbound rules only required for head node or all VMs in VPC? We allow all egress traffic by default, but for security should only be exposing necessary inbound ports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only head node. I'll make a note for that.
--gpus all \ | ||
-p 9400:9400 \ | ||
--name dcgm-exporter \ | ||
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this version of dcgm-exporter crashloops with the following error
time="2025-01-13T22:13:40Z" level=info msg="DCGM successfully initialized!"
time="2025-01-13T22:13:40Z" level=info msg="Collecting DCP Metrics"
time="2025-01-13T22:13:40Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'"
time="2025-01-13T22:13:40Z" level=info msg="Initializing system entities of type: GPU"
time="2025-01-13T22:13:40Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2025-01-13T22:13:40Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2025-01-13T22:13:40Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2025-01-13T22:13:40Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2025-01-13T22:13:40Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
instead I tested with the latest image version available and it gives me GPU metrics on /metrics endpoint. Although dcgmi discovery fails with the following error from inside the container. I think this is expected. dcgmi discovery works properly on the host/compute node listing all the GPUs correctly so I think we can ignore this?
Error: unable to establish a connection to the specified host: localhost
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.
Summary
This PR contains a cookbook for SLURM observability stack.
For new content
When contributing new content, read through our contribution guidelines, and mark the following action items as completed:
We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.