Multinode GPU Fragmentation
This alert identifies GPU jobs that are using too many nodes or too few GPUs per node.
Consider a cluster with 4 GPUs per node. A user can run a job with 8 GPUs by either (1) allocating 4 GPUs on 2 nodes or (2) allocating 1 GPU on 8 nodes. The former is in general strongly preferred. This alert catches jobs doing the latter, i.e., multinode jobs that allocate less than the number of available GPUs per node (e.g., 1 GPU on 8 nodes).
Jobs with a GPU at 0% utilization are ignored since they will be captured by either Cancel 0% GPU Jobs or GPU-hours at 0% Utilization.
Configuration File
Below is an example entry for config.yaml
:
multinode-gpu-fragmentation-1:
cluster: della
partitions:
- llm
gpus_per_node: 8 # count
min_run_time: 61 # minutes
email_file: "multinode_gpu_fragmentation.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
gpus_per_node
: The number of GPUs per node. -
email_file
: The text file to be used as the email message to users. -
min_run_time
: (Optional) The number of minutes that a job must have ran to be considered. This can be used to exclude test jobs and experimental jobs. Default: 0 -
include_running_jobs
: (Optional) IfTrue
then jobs in a state ofRUNNING
will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
Below is an example report for system administrators:
$ python job_defense_shield.py --multinode-gpu-fragmentation
Multinode GPU Jobs with Fragmentation
----------------------------------------------------------------
JobID User GPUs Nodes GPUs-per-Node Hours GPU-Eff Emails
----------------------------------------------------------------
62666282 u42994 8 2 4 7 98% 0
62666283 u42994 8 2 4 6 98% 0
62666284 u42994 8 2 4 6 99% 0
62666285 u42994 8 2 4 7 98% 0
62666286 u42994 8 2 4 6 98% 0
62666287 u42994 8 2 4 6 98% 0
62666288 u42994 8 2 4 7 98% 0
62666289 u42994 8 2 4 7 99% 0
62666290 u42994 8 2 4 7 98% 0
62666291 u42994 8 2 4 7 98% 0
62816261 u18375 4 2 2 1.1 10% 1 (5)
----------------------------------------------------------------
Cluster: della
Partitions: llm
Start: Mon Mar 12, 2025 at 11:18 AM
End: Wed Mar 19, 2025 at 11:18 AM
Email Message to Users
Below is an example email message to a user (see email/multinode_gpu_fragmentation.txt
):
Hello Alan (u42994),
Below are jobs that ran on Della in the past 7 days that used 1 GPU per node
over multiple nodes:
JobID User GPUs Nodes GPUs-per-Node Hours State GPU-eff
60969293 aturing 4 2 2 24 TO 0%
The GPU nodes on Della have 8 GPUs per node. For future jobs, please try to
use as few nodes as possible by allocating more GPUs per node. This is done
by modifying the --gres Slurm directive.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: Greeting generated bygreeting_method
.<CLUSTER>
: Name of the cluster specified for the alert.<PARTITIONS>
: A comma-separated list of partitions as defined for the alert inconfig.yaml
.<GPUS-PER-NODE>
: Number of GPUs per node (i.e.,gpus_per_node
).<TABLE>
: Table of job data.<DAYS>
: Number of days in the time window (default is 7).
Usage
Generate a report for system administrators:
Send emails to the offending users:
See which users have received emails and when:
cron
Below is an example entry for crontab
: