Automatically Cancel GPU Jobs at 0% Utilization

This alert cancels GPU jobs at 0% utilization.

Elevated Privileges

This alert is different than the others in that it must be ran as a user with sufficient privileges to call scancel.

The software can be configured to cancel jobs based on GPU utilization during the first N minutes of a job (see cancel_minutes) and/or during the last N minutes (see sliding_cancel_minutes). Warning emails can be sent to the users before cancellation.

The ability to automatically cancel GPU jobs is one of the most popular features of Jobstats.

Configuration File

Below is an example entry for the configuration file:

cancel-zero-gpu-jobs-1:
  cluster: della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  second_warning_minutes: 105  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning:  "cancel_gpu_jobs_warning_1.txt"
  email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  sliding_warning_minutes: 240  # minutes
  sliding_cancel_minutes:  300  # minutes
  email_file_sliding_warning: "cancel_gpu_jobs_sliding_warning.txt"
  email_file_sliding_cancel:  "cancel_gpu_jobs_sliding_cancel.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu
  excluded_users:
    - u12345
    - u23456

The settings are explained below:

cluster: Specify the cluster name as it appears in the Slurm database.
partitions: Specify one or more Slurm partitions. Use "*" to include all partitions (i.e., partitions: ["*"]).
sampling_period_minutes: Number of minutes between executions of this alert. This number must be equal to the time between cron jobs for this alert (see cron section below). One reasonable choice for this setting is 15 minutes.
first_warning_minutes: (Optional) Send a warning email for 0% GPU utilization after this number of minutes from the start of the job.
second_warning_minutes: (Optional) Send a second warning email for 0% GPU utilization after this number of minutes from the start of the job.
cancel_minutes: (Optional/Required) Cancel jobs with 0% GPU utilization after this number of minutes from the start of the job. One must set cancel_minutes and/or sliding_cancel_minutes.
email_file_first_warning: (Optional) File to be used for the first warning email when cancel_minutes is set.
email_file_second_warning: (Optional) File to be used for the second warning email when cancel_minutes is set.
email_file_cancel: (Optional/Required) File to be used for the cancellation email. If cancel_minutes is set then this file is required.
sliding_warning_minutes: (Optional/Required) Send a warning email for jobs found with 0% GPU utilization during a sliding time window of this number of minutes. This setting is required if sliding_cancel_minutes is used. If cancel_minutes and sliding_cancel_minutes are both used then a job must run for max(cancel_minutes + sampling_period_minutes, sliding_warning_minutes) before a warning can be sent using the sliding window approach. After the warning is sent, the job can be cancelled sliding_cancel_minutes minus sliding_warning_minutes later. See warning_frac to learn how to send warnings and cancel jobs as soon as possible.
sliding_cancel_minutes: (Optional/Required) Cancel jobs found with 0% GPU utilization during a sliding time window of this number of minutes. This setting uses a sliding time window whereas cancel_minutes uses a fixed time window over the start of job. If cancel_minutes is also set then a job must run for at least cancel_minutes plus sampling_period_minutes plus the difference between sliding_cancel_minutes and sliding_warning_minutes before it can be cancelled by the sliding window approach. Users are guaranteed to receive a warning email before cancellation. One must set cancel_minutes and/or sliding_cancel_minutes.
email_file_sliding_warning: (Optional/Required) File to be used for the warning email. If sliding_cancel_minutes is set then this setting and sliding_warning_minutes are required.
email_file_sliding_cancel: (Optional/Required) File to be used for the cancellation email. This setting is required if sliding_cancel_minutes is set.
email_subject: (Optional) Subject of the email message to users.
jobid_cache_path: (Optional/Required) Path to a writable directory to store hidden cache files. Caching decreases the load on the Prometheus server. This setting is required if sliding_cancel_minutes is set. Use ls -a to see the hidden files.
max_interactive_hours: (Optional) An interactive job will not be cancelled if the run time limit is less than or equal to max_interactive_hours and the number of allocated GPUs is less than or equal to max_interactive_gpus. Remove these lines if interactive jobs should not receive special treatment. An interactive job is one with a jobname that starts with either interactive or sys/dashboard. If max_interactive_hours is specified then max_interactive_gpus is required.
max_interactive_gpus: (Optional) See max_interactive_hours above.
gpu_frac_threshold: (Optional) For a given job, let g be the ratio of the number of GPUs with non-zero utilization to the number of allocated GPUs. Jobs with g greater than or equal to gpu_frac_threshold will be excluded. For example, if a job uses 7 of the 8 allocated GPUs and gpu_frac_threshold is 0.8 then it will be excluded from cancellation since 7/8 > 0.8. This quantity varies between 0 and 1. Default: 1.0
fraction_of_period: (Optional) Fraction of the sampling period that can be used for querying the Prometheus server. The sampling period or sampling_period_minutes is the time between cron jobs for this alert. This setting imposes a limit on the amount of time spent on querying the server so that the code finishes before the next cron job. This quantity varies between 0 and 1 with the default being 0.5. If output such as INFO: Only cached 42 of 100 jobs. Will try again on next call. is repeatedly seen (excluding the starting period) then consider increasing fraction_of_period and/or sampling_period_minutes. If there are multiple entries for this alert then use a maximum value for this setting of less than 0.75 divided by the number of entries. Default: 0.5
warning_frac: (Optional) Fraction of sliding_warning_minutes that must pass before a job that was previously found to be using the GPUs will be re-examined for idle GPUs. This quantity varies between 0 and 1. The default value of 1.0 minimizes the load on the Prometheus server but it can allow jobs with idle GPUs to run for longer than necessary. To cancel jobs sooner, at the expense of more calls to Prometheus, use a smaller value such as 0.25 or 0.5. If warning_frac: 0.5and sliding_warning_minutes: 240 then jobs that have been found to be using the GPU(s) at least once in the last 240 minutes will be checked again 120 minutes later. The product of warning_frac and sliding_warning_minutes should be much greater than sampling_period_minutes. Default: 1.0
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. This setting is only available for cancel_minutes. It will not work with sliding_cancel_minutes. See example.
excluded_qos: (Optional) List of QOSes to exclude from this alert.
excluded_partitions: (Optional) List of partitions to exclude from this alert. This is useful when partitions: ["*"] is used.
excluded_users: (Optional) List of usernames to exclude from this alert.
do_not_cancel: (Optional) If True then scancel will not be called. This is useful for testing only. In this case, one should call the alert with --email --no-emails-to-users. Default: False
warnings_to_admin: (Optional) If True then warning emails (in addition to cancellation emails) will be sent to admin_emails. This is useful for testing. Default: False
admin_emails: (Optional) List of administrator email addresses that should receive the warning and cancellation emails that are sent to users.

Times are Not Exact

Jobs are not cancelled after exactly cancel_minutes or sliding_cancel_minutes since Slurm jobs can start at any time and the alert is only called every N minutes via cron or another scheduler. The same is true for warning emails.

In Jobstats, a GPU is said to have 0% utilization if all of the measurements made by the NVIDIA exporter over a given time window are zero. Measurements are typically made every 30 seconds or so. For the actual value at your institution see SAMPLING_PERIOD in config.py for Jobstats.

Example Configurations

Jobs with GPUs that are idle for the first 2 hours (120 minutes) will be cancelled:

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  scond_warning_minutes:  105  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning:  "cancel_gpu_jobs_warning_1.txt"
  email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

The configuration above will send warning emails after 60 and 105 minutes.

The example below is the same as that above except only one warning email will be sent:

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning: "cancel_gpu_jobs_warning_1.txt"
  email_file_cancel:        "cancel_gpu_jobs_scancel_3.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

Cancel jobs that are found to have 0% GPU utilization over any time period of 5 hours (300 minutes):

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes:  15  # minutes
  sliding_warning_minutes: 240  # minutes
  sliding_cancel_minutes:  300  # minutes
  email_file_sliding_warning: "cancel_gpu_jobs_sliding_warning.txt"
  email_file_sliding_cancel:  "cancel_gpu_jobs_sliding_cancel.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

For the entry above, the user would receive a warning email after 4 hours (240 minutes).

The example that follows uses both cancellation methods. Jobs with GPUs that are idle for the first 2 hours (120 minutes) will be cancelled. Jobs with idle GPU(s) for 5 hours (300 minutes) during any period will be cancelled.

cancel-zero-gpu-jobs-1:
  cluster: della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  second_warning_minutes: 105  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning:  "cancel_gpu_jobs_warning_1.txt"
  email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  sliding_warning_minutes: 240  # minutes
  sliding_cancel_minutes:  300  # minutes
  email_file_sliding_warning: "cancel_gpu_jobs_sliding_warning.txt"
  email_file_sliding_cancel:  "cancel_gpu_jobs_sliding_cancel.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

Testing

For testing, be sure to use:

  do_not_cancel: True
  warnings_to_admin: True

Additionally, add the --no-emails-to-users flag:

$ job_defense_shield --cancel-zero-gpu-jobs --email --no-emails-to-users

Learn more about email testing.