SLURM observability cookbook #3

piotrrojek · 2024-12-19T14:28:06Z

Summary

This PR contains a cookbook for SLURM observability stack.

For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
I have conducted a self-review of my content based on the contribution guidelines:
- Relevance: This content is related to building with Crusoe Cloud and is useful to others.
- Uniqueness: I have searched for related examples in the Crusoe Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
- Spelling and Grammar: I have checked for spelling or grammatical mistakes.
- Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
- Correctness: The information I include is correct and all of my code executes successfully.
- Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

martin-cala1 · 2024-12-31T22:13:51Z

content/observability-slurm/README.md

+
+#### 2. *Create Users and Directories*
+
+_On all nodes_


For commands that are required to be run on all compute nodes, can we recommend using a tool like pssh or clush ( https://clustershell.readthedocs.io/en/latest/tools/clush.html ) to make it easier to get started with large cluster management? Perhaps this warrants another mini section on setting up a hostfile for parallel VM management

After internal discussion, we suggest to use Ansible for this. In fact, the SLURM observability role is submitted as a PR (I'm going to remove draft PR mode to ready to review today after final testing).
The reason we propose that is that Ansible is already used widely in Crusoe, it's also present in GPUd cookbook. In order not to overload the main SLURM repo we can extract it to be its own playbook too.

martin-cala1 · 2024-12-31T22:14:51Z

content/observability-slurm/README.md

+- Operating System: Ubuntu 20.04 or newer
+- Firewall: Ability to configure firewall rules for required ports
+
+### Security Considerations


Are these inbound rules only required for head node or all VMs in VPC? We allow all egress traffic by default, but for security should only be exposing necessary inbound ports

Only head node. I'll make a note for that.

ApekshaKhilari · 2025-01-13T22:38:02Z

content/observability-slurm/README.md

+  --gpus all \
+  -p 9400:9400 \
+  --name dcgm-exporter \
+  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04


this version of dcgm-exporter crashloops with the following error

time="2025-01-13T22:13:40Z" level=info msg="DCGM successfully initialized!" time="2025-01-13T22:13:40Z" level=info msg="Collecting DCP Metrics" time="2025-01-13T22:13:40Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'" time="2025-01-13T22:13:40Z" level=info msg="Initializing system entities of type: GPU" time="2025-01-13T22:13:40Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2025-01-13T22:13:40Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" time="2025-01-13T22:13:40Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2025-01-13T22:13:40Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2025-01-13T22:13:40Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

instead I tested with the latest image version available and it gives me GPU metrics on /metrics endpoint. Although dcgmi discovery fails with the following error from inside the container. I think this is expected. dcgmi discovery works properly on the host/compute node listing all the GPUs correctly so I think we can ignore this?

Error: unable to establish a connection to the specified host: localhost Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

piotrrojek added 2 commits December 19, 2024 15:27

SLURM observability cookbook

24a67d6

Add some security considerations, troubleshooting, log rotation

7730ca9

piotrrojek marked this pull request as ready for review December 23, 2024 14:25

martin-cala1 reviewed Dec 31, 2024

View reviewed changes

Add note about Grafana access on the head node

49e0b16

ApekshaKhilari reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM observability cookbook #3

SLURM observability cookbook #3

piotrrojek commented Dec 19, 2024 •

edited

Loading

martin-cala1 Dec 31, 2024

piotrrojek Jan 8, 2025

martin-cala1 Dec 31, 2024

piotrrojek Jan 8, 2025

ApekshaKhilari Jan 13, 2025 •

edited

Loading

SLURM observability cookbook #3

Are you sure you want to change the base?

SLURM observability cookbook #3

Conversation

piotrrojek commented Dec 19, 2024 • edited Loading

Summary

For new content

martin-cala1 Dec 31, 2024

Choose a reason for hiding this comment

piotrrojek Jan 8, 2025

Choose a reason for hiding this comment

martin-cala1 Dec 31, 2024

Choose a reason for hiding this comment

piotrrojek Jan 8, 2025

Choose a reason for hiding this comment

ApekshaKhilari Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

piotrrojek commented Dec 19, 2024 •

edited

Loading

ApekshaKhilari Jan 13, 2025 •

edited

Loading