Jobs with 0% CPU Utilization
This alert identifies jobs with 0% CPU utilization.
The CPU utilization is calculated across all allocated CPU-cores on each node. A job will be included if the CPU utilization is 0% on any of the nodes. This alert is not capable of detecting individual CPU-cores that are idle unless the job only allocates one CPU-core per node.
Configuration File
Below is an example entry for config.yaml:
zero-cpu-utilization-1:
  cluster: stellar
  partitions:
    - cpu
  min_run_time: 61 # minutes
  email_file: "zero_cpu_utilization.txt"
  admin_emails:
    - admin@institution.edu
The parameters are explained below:
- 
cluster: Specify the cluster name as it appears in the Slurm database. - 
partitions: Specify one or more Slurm partitions. Use"*"to include all partitions (i.e.,partitions: ["*"]). - 
email_file: The text file to be used for the email message to users. - 
cpu_hours_threshold: (Optional) Only users with greater than or equal to this number of CPU-hours at 0% utilization will receive an email. Default: 0 - 
min_run_time: (Optional) Minimum run time of a job in units of minutes. Ifmin_run_time: 61then jobs that ran for an hour or less are ignored. Default: 0 - 
include_running_jobs: (Optional) IfTruethen jobs in a state ofRUNNINGwill be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False - 
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. - 
excluded_qos: (Optional) List of QOSes to exclude from this alert. - 
excluded_partitions: (Optional) List of partitions to exclude from this alert. This is useful whenpartitions: ["*"]is used. - 
excluded_users: (Optional) List of usernames to exclude from receiving emails. - 
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. - 
email_subject: (Optional) Subject of the email message to users. - 
report_title: (Optional) Title of the report to system administrators. 
Report for System Administrators
Below is an example report:
$ job_defense_shield --zero-cpu-utilization
                     Jobs with Zero CPU Utilization                          
---------------------------------------------------------------------------
 JobID    User   Nodes  Nodes-Unused  CPU-Util-Unused  Cores  Hours  Emails
---------------------------------------------------------------------------
1931133  u12345    11         11             0%        1056    48     3 (1)   
1932935  u12345    11         11             0%        1056    48     0   
1932937  u48726     8          4             0%         768     2     0   
1933655  u52209     1          1             0%          96    24     0   
---------------------------------------------------------------------------
   Cluster: stellar
Partitions: cpu, physics
     Start: Wed Mar 12, 2025 at 02:50 PM
       End: Wed Mar 19, 2025 at 02:50 PM
Email Message to Users
Below is an example email (see email/zero_cpu_utilization.txt):
Hello Alan (u12345),
Below are your recent jobs that did not use all of the allocated nodes:
     JobID  Cluster  Nodes  Nodes-Unused CPU-Util-Unused  Cores Hours
    1931133 stellar   11         11             0%        1056   48 
    1932935 stellar   11         11             0%        1056   48 
The CPU utilization was found to be 0% on each of the unused nodes. Please
investigate this issue before running additional jobs.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>: The greeting generated bygreeting-method.<CLUSTER>: The cluster specified for the alert.<PARTITIONS>: A comma-separated list of partitions used by the user.<DAYS>: Number of days in the time window (default is 7).<NUM-JOBS>: Number of jobs with 0% CPU utilization.<TABLE>: Table of job data.<JOBSTATS>: Thejobstatscommand for the first job of the user.
Usage
Generate a report for system adminstrators:
Send emails to offending users:
See which users have received emails and when:
cron
Below is an example crontab entry: