Skip to content

Jobs with 0% CPU Utilization

This alert identifies jobs with 0% CPU utilization.

The CPU utilization is calculated across all allocated CPU-cores on each node. A job will be included if the CPU utilization is 0% on any of the nodes. This alert is not capable of detecting individual CPU-cores that are idle unless the job only allocates one CPU-core per node.

Configuration File

Below is an example entry for config.yaml:

zero-cpu-utilization-1:
  cluster: stellar
  partitions:
    - cpu
  min_run_time: 61 # minutes
  email_file: "zero_cpu_utilization.txt"
  admin_emails:
    - admin@institution.edu

The parameters are explained below:

  • cluster: Specify the cluster name as it appears in the Slurm database.

  • partitions: Specify one or more Slurm partitions.

  • email_file: The text file to be used for the email message to users.

  • cpu_hours_threshold: (Optional) Only users with greater than or equal to this number of CPU-hours at 0% utilization will receive an email. Default: 0

  • min_run_time: (Optional) Minimum run time of a job in units of minutes. If min_run_time: 61 then jobs that ran for an hour or less are ignored. Default: 0

  • include_running_jobs: (Optional) If True then jobs in a state of RUNNING will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False

  • nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.

  • excluded_users: (Optional) List of usernames to exclude from receiving emails.

  • admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.

  • email_subject: (Optional) Subject of the email message to users.

  • report_title: (Optional) Title of the report to system administrators.

Report for System Administrators

Below is an example report:

$ python job_defense_shield.py --zero-cpu-utilization

                     Jobs with Zero CPU Utilization                          
---------------------------------------------------------------------------
 JobID    User   Nodes  Nodes-Unused  CPU-Util-Unused  Cores  Hours  Emails
---------------------------------------------------------------------------
1931133  u12345    11         11             0%        1056    48     3 (1)   
1932935  u12345    11         11             0%        1056    48     0   
1932937  u48726     8          4             0%         768     2     0   
1933655  u52209     1          1             0%          96    24     0   
---------------------------------------------------------------------------
   Cluster: stellar
Partitions: cpu, physics
     Start: Wed Mar 12, 2025 at 02:50 PM
       End: Wed Mar 19, 2025 at 02:50 PM

Email Message to Users

Below is an example email (see email/zero_cpu_utilization.txt):

Hello Alan (u12345),

Below are your recent jobs that did not use all of the allocated nodes:

     JobID  Cluster  Nodes  Nodes-Unused CPU-Util-Unused  Cores Hours
    1931133 stellar   11         11             0%        1056   48 
    1932935 stellar   11         11             0%        1056   48 

The CPU utilization was found to be 0% on each of the unused nodes. Please
investigate this issue before running additional jobs.

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

  • <GREETING>: The greeting generated by greeting-method.
  • <CLUSTER>: The cluster specified for the alert.
  • <PARTITIONS>: The partitions listed for the alert.
  • <DAYS>: Number of days in the time window (default is 7).
  • <NUM-JOBS>: Number of jobs with 0% CPU utilization.
  • <TABLE>: Table of job data.
  • <JOBSTATS>: The jobstats command for the first job of the user.

Usage

Generate a report for system adminstrators:

$ python job_defense_shield.py --zero-cpu-utilization

Send emails to offending users:

$ python job_defense_shield.py --zero-cpu-utilization --email

See which users have received emails and when:

$ python job_defense_shield.py --zero-cpu-utilization --check

cron

Below is an example crontab entry:

0 9 * * 1-5 /path/to/python path/to/job_defense_shield.py --zero-cpu-utilization --email -M della -r gpu,llm > /path/to/log/zero_cpu_utilization.log 2>&1