Excessive Run Time Limits for GPU Jobs

This alert identifies users that are using excessive run time limits for GPU jobs (e.g., requesting 3 days but only needing 3 hours).

Configuration File

Below is an example entry for config.yaml:

excessive-time-gpu-1:
  cluster: della
  partitions:
    - gpu
    - llm
  min_run_time:             61  # minutes
  absolute_thres_hours:   1000  # unused gpu-hours
  overall_ratio_threshold: 0.2  # [0.0, 1.0]
  mean_ratio_threshold:    0.2  # [0.0, 1.0]
  median_ratio_threshold:  0.2  # [0.0, 1.0]
  num_top_users:            10  # count
  num_jobs_display:         10  # count
  email_file: "excessive_time.txt"
  admin_emails:
    - admin@institution.edu

The available settings are listed below:

cluster: Specify the cluster name as it appears in the Slurm database.
partitions: Specify one or more Slurm partitions.
absolute_thres_hours: Minimum number of unused GPU-hours for the user to be considered to receive an email.
overall_ratio_threshold: Total used GPU-hours divided by total allocated GPU-hours.
email_file: The text file to be used for the email message to users.
mean_ratio_threshold: (Optional) Mean of the per job ratio of used GPU-hours to allocated GPU-hours.
median_ratio_threshold: (Optional) Same as above but for the median instead of the mean.
min_run_time: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, if min_run_time: 30 is used then jobs that ran for less than 30 minutes are ignored. Default: 0
num_top_users: (Optional) Only consider the number of users equal to num_top_users after sorting by unused GPU-hours. Default: 10
show_all_users: (Optional) Flag to show all of the top users in the report instead of only the top users with low time efficiency. Default: False
num_jobs_display: (Optional) Number of jobs to display in the email message to users. Default: 10
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.
excluded_users: (Optional) List of usernames to exclude from the alert.
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.
email_subject: (Optional) Subject of the email message to users.
report_title: (Optional) Title of the report to system administrators.

Report for System Administrators

Below is an example report:

$ job_defense_shield --excessive-time-gpu

                      Excessive Run Time Limits                        
-----------------------------------------------------------------------
 User   GPU-Hours  GPU-Hours  Ratio  Ratio   Ratio GPU-Rank Jobs Emails
         (Unused)    (Used)  Overall  Mean  Median                     
-----------------------------------------------------------------------
u14480    4088       184      0.04   0.04   0.04     15     60    0   
u72284    2055       105      0.05   0.05   0.04     23     45    2 (3)   
-----------------------------------------------------------------------
   Cluster: della
Partitions: gpu, llm
     Start: Sat Mar 15, 2025 at 02:34 PM
       End: Sat Mar 22, 2025 at 02:34 PM

Email Message to Users

Below is an example email message (see email/excessive_time.txt):

Hello Alan (u14480),

Below are 5 of your 60 jobs that ran on della (gpu) in the past 7 days:

     JobID    Time-Used  Time-Allocated  Percent-Used  GPUs
    62776009   01:36:11    2-00:00:00         3%        4 
    62776014   01:32:11    2-00:00:00         3%        4
    62776016   01:22:41    2-00:00:00         3%        4
    62776019   01:17:48    2-00:00:00         3%        4
    62776020   01:29:20    2-00:00:00         3%        4

It appears that you are requesting too much time for your jobs since you are
only using on average 3% of the allocated time (for the 60 jobs). This has
resulted in 4341 GPU-hours that you scheduled but did not use (it was made
available to other jobs, however).

Please request less time by modifying the --time Slurm directive. This will
lower your queue times and allow the Slurm job scheduler to work more
effectively for all users.

Time-Used is the time (wallclock) that the job needed. The total time allocated
for the job is Time-Allocated. The format is DD-HH:MM:SS where DD is days,
HH is hours, MM is minutes and SS is seconds. Percent-Used is Time-Used
divided by Time-Allocated.

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

<GREETING>: Greeting generated by greeting-method.
<CLUSTER>: The cluster specified for the alert.
<PARTITIONS>: The partitions listed for the alert.
<MODE-UPPER>: Equal to "GPU".
<CASE>: Helper text relating to jobs.
<AVERAGE>: Mean of per-job GPU-hours used divided by allocated.
<DAYS>: Number of days in the time window (default is 7).
<NUM-JOBS>: Total number of jobs.
<NUM-JOBS-DISPLAY>: Number of jobs to list in the table.
<TABLE>: Table of job data.
<UNUSED-HOURS>: Total number of unused GPU-hours.

Usage

Generate report for system administrators:

$ job_defense_shield --excessive-time-gpu

Send emails to offending users:

$ job_defense_shield --excessive-time-gpu --email

See which users have received emails and when:

$ job_defense_shield --excessive-time-gpu --check

Cron

Below is an example crontab entry:

0 9 * * 1-5 /path/to/job_defense_shield --excessive-time-gpu --email -M della -r gpu,llm > /path/to/log/excessive_time_gpu.log 2>&1