Skip to content

Low GPU Utilization

This alert identifies users with low GPU efficiency.

Configuration File

Below is an example entry for config.yaml:

low-gpu-efficiency-1:
  cluster: della
  partitions:
    - gpu
  min_run_time: 30          # minutes
  eff_thres_pct: 25         # percent
  eff_target_pct: 50        # percent
  absolute_thres_hours: 50  # gpu-hours
  proportion_thres_pct: 2   # percent
  num_top_users: 15         # count
  email_file: "low_gpu_efficiency.txt"
  admin_emails:
    - admin@institution.edu

The available settings are listed below:

  • cluster: Specify the cluster name as it appears in the Slurm database.

  • partitions: Specify one or more Slurm partitions. The number of GPU-hours is summed over all partitions.

  • eff_thres_pct: Efficiency threshold percentage. Users with a GPU efficiency of less than or equal to this value will be considered to receive an email.

  • eff_target_pct: The minimum acceptable GPU utilization for a user. This quantity is not used in any calculations but it can be referenced using <TARGET> in the email file.

  • absolute_thres_hours: Absolute threshold hours. A user must have allocated more than this number of GPU-hours to receive an email.

  • email_file: The text file to be used as the email message to users.

  • num_top_users: (Optional) After sorting all users by GPU-hours, only consider the top num_top_users for all remaining calculations and emails. This is used to limit the number of users that receive emails and appear in reports. Default: 15

  • min_run_time: (Optional) The number of minutes that a job must have ran to be considered. Default: 0

  • proportion_thres_pct: (Optional) Proportion threshold percentage. A user must be using at least this proportion of the total GPU-hours (as a percentage) in order to be sent an email. For example, setting this to 2 will exclude all users that are using less than 2% of the total GPU-hours. Default: 0

  • excluded_users: (Optional) List of usernames to exclude from receiving emails.

  • admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.

  • email_subject: (Optional) Subject of the email message to users.

  • report_title: (Optional) Title of the report to system administrators.

Report for System Administrators

Here is an example of the report:

$ python job_defense_shield.py --low-gpu-efficiency

                    Low GPU Efficiencies                                      
------------------------------------------------------------
 User   GPU-Hours  Proportion(%)  GPU-Eff(%)  Jobs  AvgCores
------------------------------------------------------------
u76174    3791         20             19       58     1.2  
u64732    3201         17             15       43     1.0 
u13301    2281         12              9       35     8.0
------------------------------------------------------------
   Cluster: della
Partitions: gpu
     Start: Wed Mar 12, 2025 at 09:56 AM
       End: Wed Mar 19, 2025 at 09:56 AM

Email Message to Users

Below is an example email (see email/low_gpu_efficiencies.txt):

Hello Alan (u12345),

Over the last 7 days you have used the 3rd most GPU-hours on della (pli-c) but
your mean GPU efficiency is only 29%:

     User  Partition(s)  Jobs  GPU-hours GPU-rank Efficiency
    jg9945    pli-c       65     2345      3/53      29%    

A good target value for "Efficiency" is 75% and above. Please investigate the
reason for the low efficiency.

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

  • <GREETING>: The greeting generated by greeting-method.
  • <CLUSTER>: The name of the cluster as defined in config.yaml.
  • <PARTITIONS>: A comma-separated list of the partitions used by the user.
  • <EFFICIENCY>: The mean GPU efficiency of the user.
  • <RANK>: The GPU-rank of the user as determined by number of allocated GPU-hours.
  • <DAYS>: Number of days in the time window (default is 7).
  • <TARGET>: The value of eff_target_pct from the alert settings.
  • <TABLE>: A table of showing mean efficiency and other values for the user.
  • <JOBSTATS>: The jobstats command for the first job of the user.

Usage

Generate a report of the top users with low GPU efficiencies:

$ python job_defense_shield.py --low-gpu-efficiency

Send emails to users with low GPU efficiencies over the past 7 days:

$ python job_defense_shield.py --low-gpu-efficiency --email

See which users have received emails and when:

$ python job_defense_shield.py --low-gpu-efficiency --check

cron

Below is an example entry for crontab:

0 9 * * * /path/to/python /path/to/job_defense_shield.py --low-gpu-efficiency --email > /path/to/log/low_gpu_efficiency.log 2>&1