Automatically Cancel GPU Jobs at 0% Utilization
This is one of the most popular features of Jobstats. This alert automatically cancels jobs with GPUs at 0% utilization. Up to two warning emails can be sent before each job is cancelled.
Elevated Privileges
This alert is different than the others in that it must be ran as
a user with sufficient privileges to call scancel
.
Configuration File
Below is an example entry for the configuration file:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
first_warning_minutes: 60 # minutes
second_warning_minutes: 105 # minutes
cancel_minutes: 120 # minutes
email_file_first_warning: "cancel_gpu_jobs_warning_1.txt"
email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
max_interactive_hours: 8
max_interactive_gpus: 1
do_not_cancel: False
warnings_to_admin: True
admin_emails:
- admin@institution.edu
excluded_users:
- u12345
- u23456
The settings are explained below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. per alert. -
partitions
: Specify one or more Slurm partitions. -
sampling_period_minutes
: Number of minutes between executions of this alert. This number must also be the same incron
(see cron section below) or the scheduler that is used. -
first_warning_minutes
: (Optional) Number of minutes that the job must run before the first warning email can be sent. -
second_warning_minutes
: (Optional) Number of minutes that the job must run before the second warning email can be sent. -
cancel_minutes
: (Required) Number of minutes that the job must run before it can be cancelled. -
email_file_first_warning
: (Optional) File to be used for the first warning email. -
email_file_second_warning
: (Optional) File to be used for the second warning email. -
email_file_cancel
: (Required) File to be used for the cancellation email. -
email_subject
: (Optional) Subject of the email message to users. -
jobid_cache_path
: (Optional) Path to a writable directory where a cache file containing thejobid
of each job known to be using the GPUs is stored. This is a binary file with the name.jobid_cache.pkl
. Including this setting will eliminate redundant calls to the Prometheus server. -
max_interactive_hours
: (Optional) An interactive job will only be cancelled if the run time limit is greater thanmax_interactive_hours
and the number of allocated GPUs is less than or equal tomax_interactive_gpus
. Remove these lines if interactive jobs should not receive special treatment. An interactive job is one with ajobname
that starts with eitherinteractive
orsys/dashboard
. -
max_interactive_gpus
: (Optional) See line above. -
gpu_frac_threshold
: For a given job, letg
be the ratio of the number of GPUs with non-zero utilization to the number of allocated GPUs. Jobs withg
greater than or equal togpu_frac_threshold
will be excluded. For example, if a job uses 7 of the 8 allocated GPUs andgpu_frac_threshold
is 0.8 then it will be excluded from cancellation since 7/8 > 0.8. Default: 1.0 -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from this alert. -
do_not_cancel
: (Optional) IfTrue
thenscancel
will not be called. This is useful for testing only. In this case, one should call the alert with--email --no-emails-to-users
. Default:False
-
warnings_to_admin
: (Optional) IfFalse
then warning emails will not be sent toadmin_emails
. Only cancellation emails will be sent. Default:True
-
admin_emails
: (Optional) List of administrator email addresses that should receive the warning and cancellation emails that are sent to users.
Note that jobs are not cancelled after exactly cancel_minutes
since the alert is only called every N minutes via cron or another scheduler. The same is true for warning emails.
In Jobstats, a GPU is said to have 0% utilization if all of the measurements made by the NVIDIA exporter are zero. Measurements are typically made every 30 seconds or so. For the actual value see SAMPLING_PERIOD
in config.py
for Jobstats.
Example Configurations
The example below will send one warning email and then cancel the jobs:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
first_warning_minutes: 60 # minutes
cancel_minutes: 120 # minutes
email_file_first_warning: "cancel_gpu_jobs_warning_1.txt"
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
admin_emails:
- admin@institution.edu
The example below will send no warnings but only cancel jobs after 120 minutes:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
cancel_minutes: 120 # minutes
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
admin_emails:
- admin@institution.edu
Testing
For testing, be sure to use:
Additionally, add the --no-emails-to-users
flag:
Learn more about email testing.
First Warning Email
Below is an example email for the first warning (see email/cancel_gpu_jobs_warning_1.txt
):
Hi Alan (aturing),
You have GPU job(s) that have been running for nearly 1 hour but appear to not
be using the GPU(s):
JobID Cluster Partition GPUs-Allocated GPUs-Unused GPU-Util Hours
60131148 della gpu 4 4 0% 1
Your jobs will be AUTOMATICALLY CANCELLED if they are found to not be using the
GPUs for 2 hours.
Please consider cancelling the job(s) listed above by using the "scancel"
command:
$ scancel 60131148
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<MINUTES-1ST>
: Number of minutes before the first warning is sent (first_warning_minutes
).<HOURS-1ST>
: Number of hours before the first warning is sent.<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
Second Warning Email
Below is an example email for the second warning (see email/cancel_gpu_jobs_warning_2.txt
):
Hi Alan (aturing),
This is a second warning. The jobs below will be cancelled in about 15 minutes
unless GPU activity is detected:
JobID Cluster Partition GPUs-Allocated GPUs-Unused GPU-Util Hours
60131148 della gpu 4 4 0% 1.6
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<MINUTES-1ST>
: Number of minutes before the first warning is sent (first_warning_minutes
).<MINUTES-2ND>
: Number of minutes before the second warning is sent (second_warning_minutes
).<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
Cancellation Email
Below is an example email (see email/cancel_gpu_jobs_scancel_3.txt
):
Hi Alan (aturing),
The jobs below have been cancelled because they ran for more than 2 hours at
0% GPU utilization:
JobID Cluster Partition State GPUs-Allocated GPU-Util Hours
60131148 della gpu CANCELLED 4 0% 2.1
See our GPU Computing webpage for three common reasons for encountering zero GPU
utilization:
https://your-institution.edu/knowledge-base/gpu-computing
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
cron
Below is an example crontab for this alert:
*/15 * * * * /path/to/python /path/to/job_defense_shield.py --cancel-zero-gpu-jobs --email -M della -r gpu > /path/to/log/zero_gpu_utilization.log 2>&1
Note that the alert is ran every 15 minutes. This must also be the value of sampling_period_minutes
.
Report
There is no report for this alert. To find out which users have the most GPU-hours at 0% utilization, see this alert. If you are automatically cancelling GPU jobs then no users should be able to waste significant resources.
Other Projects
One can also automatically cancel GPU jobs using the HPC Dashboard by Arizona State University.