Low GPU Utilization
This alert identifies users with low GPU efficiency.
Configuration File
Below is an example entry for config.yaml
:
low-gpu-efficiency-1:
cluster: della
partitions:
- gpu
min_run_time: 30 # minutes
eff_thres_pct: 25 # percent
eff_target_pct: 50 # percent
absolute_thres_hours: 50 # gpu-hours
proportion_thres_pct: 2 # percent
num_top_users: 15 # count
email_file: "low_gpu_efficiency.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. The number of GPU-hours is summed over all partitions. -
eff_thres_pct
: Efficiency threshold percentage. Users with a GPU efficiency of less than or equal to this value will be considered to receive an email. -
eff_target_pct
: The minimum acceptable GPU utilization for a user. This quantity is not used in any calculations but it can be referenced using<TARGET>
in the email file. -
absolute_thres_hours
: Absolute threshold hours. A user must have allocated more than this number of GPU-hours to receive an email. -
email_file
: The text file to be used as the email message to users. -
num_top_users
: (Optional) After sorting all users by GPU-hours, only consider the topnum_top_users
for all remaining calculations and emails. This is used to limit the number of users that receive emails and appear in reports. Default: 15 -
min_run_time
: (Optional) The number of minutes that a job must have ran to be considered. Default: 0 -
proportion_thres_pct
: (Optional) Proportion threshold percentage. A user must be using at least this proportion of the total GPU-hours (as a percentage) in order to be sent an email. For example, setting this to 2 will exclude all users that are using less than 2% of the total GPU-hours. Default: 0 -
excluded_users
: (Optional) List of usernames to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
Here is an example of the report:
$ python job_defense_shield.py --low-gpu-efficiency
Low GPU Efficiencies
------------------------------------------------------------
User GPU-Hours Proportion(%) GPU-Eff(%) Jobs AvgCores
------------------------------------------------------------
u76174 3791 20 19 58 1.2
u64732 3201 17 15 43 1.0
u13301 2281 12 9 35 8.0
------------------------------------------------------------
Cluster: della
Partitions: gpu
Start: Wed Mar 12, 2025 at 09:56 AM
End: Wed Mar 19, 2025 at 09:56 AM
Email Message to Users
Below is an example email (see email/low_gpu_efficiencies.txt
):
Hello Alan (u12345),
Over the last 7 days you have used the 3rd most GPU-hours on della (pli-c) but
your mean GPU efficiency is only 29%:
User Partition(s) Jobs GPU-hours GPU-rank Efficiency
jg9945 pli-c 65 2345 3/53 29%
A good target value for "Efficiency" is 75% and above. Please investigate the
reason for the low efficiency.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The name of the cluster as defined inconfig.yaml
.<PARTITIONS>
: A comma-separated list of the partitions used by the user.<EFFICIENCY>
: The mean GPU efficiency of the user.<RANK>
: The GPU-rank of the user as determined by number of allocated GPU-hours.<DAYS>
: Number of days in the time window (default is 7).<TARGET>
: The value ofeff_target_pct
from the alert settings.<TABLE>
: A table of showing mean efficiency and other values for the user.<JOBSTATS>
: Thejobstats
command for the first job of the user.
Usage
Generate a report of the top users with low GPU efficiencies:
Send emails to users with low GPU efficiencies over the past 7 days:
See which users have received emails and when:
cron
Below is an example entry for crontab
: