Excessive Run Time Limits for GPU Jobs
This alert identifies users that are using excessive run time limits for GPU jobs (e.g., requesting 3 days but only needing 3 hours).
Configuration File
Below is an example entry for config.yaml
:
excessive-time-gpu-1:
cluster: della
partitions:
- gpu
- llm
min_run_time: 61 # minutes
absolute_thres_hours: 1000 # unused gpu-hours
overall_ratio_threshold: 0.2 # [0.0, 1.0]
mean_ratio_threshold: 0.2 # [0.0, 1.0]
median_ratio_threshold: 0.2 # [0.0, 1.0]
num_top_users: 10 # count
num_jobs_display: 10 # count
email_file: "excessive_time.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
absolute_thres_hours
: Minimum number of unused GPU-hours for the user to be considered to receive an email. -
overall_ratio_threshold
: Total used GPU-hours divided by total allocated GPU-hours. -
email_file
: The text file to be used for the email message to users. -
mean_ratio_threshold
: (Optional) Mean of the per job ratio of used GPU-hours to allocated GPU-hours. -
median_ratio_threshold
: (Optional) Same as above but for the median instead of the mean. -
min_run_time
: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, ifmin_run_time: 30
is used then jobs that ran for less than 30 minutes are ignored. Default: 0 -
num_top_users
: (Optional) Only consider the number of users equal tonum_top_users
after sorting by unused GPU-hours. Default: 10 -
num_jobs_display
: (Optional) Number of jobs to display in the email message to users. Default: 10 -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from the alert. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
Below is an example report:
$ python job_defense_shield.py --excessive-time-gpu
Excessive Run Time Limits
-----------------------------------------------------------------------
User GPU-Hours GPU-Hours Ratio Ratio Ratio GPU-Rank Jobs Emails
(Unused) (Used) Overall Mean Median
-----------------------------------------------------------------------
u14480 4088 184 0.04 0.04 0.04 15 60 0
u72284 2055 105 0.05 0.05 0.04 23 45 2 (3)
-----------------------------------------------------------------------
Cluster: della
Partitions: gpu, llm
Start: Sat Mar 15, 2025 at 02:34 PM
End: Sat Mar 22, 2025 at 02:34 PM
Email Message to Users
Below is an example email message (see email/excessive_time.txt
):
Hello Alan (u14480),
Below are 5 of your 60 jobs that ran on della (gpu) in the past 7 days:
JobID Time-Used Time-Allocated Percent-Used GPUs
62776009 01:36:11 2-00:00:00 3% 4
62776014 01:32:11 2-00:00:00 3% 4
62776016 01:22:41 2-00:00:00 3% 4
62776019 01:17:48 2-00:00:00 3% 4
62776020 01:29:20 2-00:00:00 3% 4
It appears that you are requesting too much time for your jobs since you are
only using on average 3% of the allocated time (for the 60 jobs). This has
resulted in 4341 GPU-hours that you scheduled but did not use (it was made
available to other jobs, however).
Please request less time by modifying the --time Slurm directive. This will
lower your queue times and allow the Slurm job scheduler to work more
effectively for all users.
Time-Used is the time (wallclock) that the job needed. The total time allocated
for the job is Time-Allocated. The format is DD-HH:MM:SS where DD is days,
HH is hours, MM is minutes and SS is seconds. Percent-Used is Time-Used
divided by Time-Allocated.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: Greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert.<PARTITIONS>
: The partitions listed for the alert.<MODE-UPPER>
: Equal to "GPU".<CASE>
: Helper text relating to jobs.<AVERAGE>
: Mean of per-job GPU-hours used divided by allocated.<DAYS>
: Number of days in the time window (default is 7).<NUM-JOBS>
: Total number of jobs.<NUM-JOBS-DISPLAY>
: Number of jobs to list in the table.<TABLE>
: Table of job data.<UNUSED-HOURS>
: Total number of unused GPU-hours.
Usage
Generate report for system administrators:
Send emails to offending users:
See which users have received emails and when:
Cron
Below is an example crontab
entry: