GPU Model Too Powerful
This alert identifies jobs that ran on GPUs that were more powerful than necessary. For example, it can find jobs that ran on an NVIDIA H100 GPU but could have used the less powerful L40S GPU or MIG. The GPU utilization, CPU/GPU memory usage, and number of allocated CPU-cores are taken into account when identifying jobs.
Info
Currently only jobs that allocate 1 GPU are considered.
Configuration File
Here is an example entry for config.yaml
:
gpu-model-too-powerful:
clusters: della
partitions:
- gpu
min_run_time: 61 # minutes
gpu_hours_threshold: 24 # gpu-hours
num_cores_threshold: 1 # count
gpu_util_threshold: 15 # percent
gpu_mem_threshold: 10 # GB
cpu_mem_threshold: 32 # GB
gpu_util_target: 50 # percent
email_file: "gpu_model_too_powerful.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
gpu_hours_threshold
: Minimum number of GPU-hours (summed over the jobs) for the user to be included. -
num_cores_threshold
: Only jobs that allocate a number of CPU-cores that is equal to or less than this value will be included. -
gpu_util_threshold
: Jobs with a mean GPU utilization of less than or equal to this value will be included. -
gpu_mem_threshold
: Jobs that used less than this value of GPU memory (in units of GB) will be considered. -
cpu_mem_threshold
: Jobs that used less than this value of CPU memory (in units of GB) will be considered. -
gpu_util_target
: The minimum acceptable GPU utilization for a user. -
email_file
: The text file to be used for the email message. -
min_run_time
: (Optional) The minimum run time of a job for it to be considered. Jobs that did not run longer than this limit will be ignored. Default: 0 -
include_running_jobs
: (Optional) IfTrue
then jobs in a state ofRUNNING
will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of users to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Nodelist
Be aware that a nodelist
can be specified. This makes it possible to isolate jobs that ran on certain nodes within a partition.
Report for System Administrators
$ python job_defense_shield.py --gpu-model-too-powerful
GPU Model Too Powerful
-----------------------------------------------------------
User GPU-Hours Jobs JobID Emails
-----------------------------------------------------------
u23157 321 5 2567707,62567708,62567709+ 0
u55404 108 5 62520246,62520247,62861050+ 1 (2)
u89790 55 2 62560705,62669923 0
-----------------------------------------------------------
Cluster: della
Partitions: gpu
Start: Tue Mar 11, 2025 at 10:50 PM
End: Tue Mar 18, 2025 at 10:50 PM
Email Message to Users
Below is an example email message (see email/gpu_model_too_powerful.txt
):
Hello Alan (u23157),
Below are jobs that ran on an A100 GPU on Della in the past 7 days:
JobID GPU-Util GPU-Mem-Used CPU-Mem-Used Hours
60984405 9% 2 GB 3 GB 3.4
60984542 8% 2 GB 3 GB 3.0
60989559 8% 2 GB 3 GB 2.8
The jobs above have a low GPU utilization and they use less than 10 GB of GPU
memory and less than 32 GB of CPU memory. Such jobs could be run on the MIG
GPUs. A MIG GPU has 1/7th the performance and memory of an A100. To run on a
MIG GPU, add the "partition" directive to your Slurm script:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --partition=mig
For interactive sessions use, for example:
$ salloc --nodes=1 --ntasks=1 --time=1:00:00 --gres=gpu:1 --partition=mig
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert.<PARTITIONS>
: The partitions listed for the alert.<TARGET>
: The minimum acceptable GPU utilization (i.e.,gpu_util_target
).<DAYS>
: Number of days in the time window (default is 7).<NUM-JOBS>
: Number of jobs that are using GPUs that are too powerful.<TABLE>
: Table of job data.<JOBSTATS>
: Thejobstats
command for the first job of the user.
Usage
Generate a report of the users that are using GPUs that are more powerful than necessary:
Send emails to the offending users:
See which users have received emails and when:
cron
Below is an example entry for crontab
: