GPU Model Too Powerful

This alert identifies jobs that ran on GPUs that were more powerful than necessary. For example, it can find jobs that ran on an NVIDIA H100 GPU but could have used the less powerful L40S GPU or MIG. The GPU utilization, CPU/GPU memory usage, and number of allocated CPU-cores are taken into account when identifying jobs.

Info

Currently only jobs that allocate 1 GPU are considered.

Configuration File

Here is an example entry for config.yaml:

gpu-model-too-powerful:
  clusters: della
  partitions:
    - gpu
  min_run_time:        61 # minutes
  gpu_hours_threshold: 24 # gpu-hours
  num_cores_threshold:  1 # count
  gpu_util_threshold:  15 # percent
  gpu_mem_threshold:   10 # GB
  cpu_mem_threshold:   32 # GB
  gpu_util_target:     50 # percent
  email_file: "gpu_model_too_powerful.txt"
  admin_emails:
    - admin@institution.edu

The available settings are listed below:

cluster: Specify the cluster name as it appears in the Slurm database.
partitions: Specify one or more Slurm partitions.
gpu_hours_threshold: Minimum number of GPU-hours (summed over the jobs) for the user to be included.
num_cores_threshold: Only jobs that allocate a number of CPU-cores that is equal to or less than this value will be included.
gpu_util_threshold: Jobs with a mean GPU utilization of less than or equal to this value will be included.
gpu_mem_threshold: Jobs that used less than this value of GPU memory (in units of GB) will be considered.
cpu_mem_threshold: Jobs that used less than this value of CPU memory (in units of GB) will be considered.
gpu_util_target: The minimum acceptable GPU utilization for a user.
email_file: The text file to be used for the email message.
min_run_time: (Optional) The minimum run time of a job for it to be considered. Jobs that did not run longer than this limit will be ignored. Default: 0
include_running_jobs: (Optional) If True then jobs in a state of RUNNING will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.
excluded_users: (Optional) List of users to exclude from receiving emails.
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.
email_subject: (Optional) Subject of the email message to users.
report_title: (Optional) Title of the report to system administrators.

Nodelist

Be aware that a nodelist can be specified. This makes it possible to isolate jobs that ran on certain nodes within a partition.

Report for System Administrators

$ job_defense_shield --gpu-model-too-powerful

                       GPU Model Too Powerful                       
-----------------------------------------------------------
  User   GPU-Hours  Jobs            JobID            Emails
-----------------------------------------------------------
 u23157    321      5    2567707,62567708,62567709+   0   
 u55404    108      5   62520246,62520247,62861050+   1 (2)
 u89790     55      2            62560705,62669923    0   
-----------------------------------------------------------
   Cluster: della
Partitions: gpu
     Start: Tue Mar 11, 2025 at 10:50 PM
       End: Tue Mar 18, 2025 at 10:50 PM

Email Message to Users

Below is an example email message (see email/gpu_model_too_powerful.txt):

Hello Alan (u23157),

Below are jobs that ran on an A100 GPU on Della in the past 7 days:

   JobID    GPU-Util GPU-Mem-Used CPU-Mem-Used  Hours
  60984405     9%       2 GB         3 GB        3.4  
  60984542     8%       2 GB         3 GB        3.0  
  60989559     8%       2 GB         3 GB        2.8  

The jobs above have a low GPU utilization and they use less than 10 GB of GPU
memory and less than 32 GB of CPU memory. Such jobs could be run on the MIG
GPUs. A MIG GPU has 1/7th the performance and memory of an A100. To run on a
MIG GPU, add the "partition" directive to your Slurm script:

  #SBATCH --nodes=1
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=1
  #SBATCH --gres=gpu:1
  #SBATCH --partition=mig

For interactive sessions use, for example:

  $ salloc --nodes=1 --ntasks=1 --time=1:00:00 --gres=gpu:1 --partition=mig

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

<GREETING>: The greeting generated by greeting-method.
<CLUSTER>: The cluster specified for the alert.
<PARTITIONS>: The partitions listed for the alert.
<TARGET>: The minimum acceptable GPU utilization (i.e., gpu_util_target).
<DAYS>: Number of days in the time window (default is 7).
<NUM-JOBS>: Number of jobs that are using GPUs that are too powerful.
<TABLE>: Table of job data.
<JOBSTATS>: The jobstats command for the first job of the user.

Usage

Generate a report of the users that are using GPUs that are more powerful than necessary:

$ job_defense_shield --gpu-model-too-powerful

Send emails to the offending users:

$ job_defense_shield --gpu-model-too-powerful --email

See which users have received emails and when:

$ job_defense_shield --gpu-model-too-powerful --check

cron

Below is an example entry for crontab:

0 9 * * * /path/to/job_defense_shield --gpu-model-too-powerful --email > /path/to/log/gpu_model_too_powerful.log 2>&1