Jobs with 0% CPU Utilization
This alert identifies jobs with 0% CPU utilization.
The CPU utilization is calculated across all allocated CPU-cores on each node. A job will be included if the CPU utilization is 0% on any of the nodes. This alert is not capable of detecting individual CPU-cores that are idle unless the job only allocates one CPU-core per node.
Configuration File
Below is an example entry for config.yaml
:
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
min_run_time: 61 # minutes
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
The parameters are explained below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
email_file
: The text file to be used for the email message to users. -
cpu_hours_threshold
: (Optional) Only users with greater than or equal to this number of CPU-hours at 0% utilization will receive an email. Default: 0 -
min_run_time
: (Optional) Minimum run time of a job in units of minutes. Ifmin_run_time: 61
then jobs that ran for an hour or less are ignored. Default: 0 -
include_running_jobs
: (Optional) IfTrue
then jobs in a state ofRUNNING
will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
Below is an example report:
$ python job_defense_shield.py --zero-cpu-utilization
Jobs with Zero CPU Utilization
---------------------------------------------------------------------------
JobID User Nodes Nodes-Unused CPU-Util-Unused Cores Hours Emails
---------------------------------------------------------------------------
1931133 u12345 11 11 0% 1056 48 3 (1)
1932935 u12345 11 11 0% 1056 48 0
1932937 u48726 8 4 0% 768 2 0
1933655 u52209 1 1 0% 96 24 0
---------------------------------------------------------------------------
Cluster: stellar
Partitions: cpu, physics
Start: Wed Mar 12, 2025 at 02:50 PM
End: Wed Mar 19, 2025 at 02:50 PM
Email Message to Users
Below is an example email (see email/zero_cpu_utilization.txt
):
Hello Alan (u12345),
Below are your recent jobs that did not use all of the allocated nodes:
JobID Cluster Nodes Nodes-Unused CPU-Util-Unused Cores Hours
1931133 stellar 11 11 0% 1056 48
1932935 stellar 11 11 0% 1056 48
The CPU utilization was found to be 0% on each of the unused nodes. Please
investigate this issue before running additional jobs.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert.<PARTITIONS>
: The partitions listed for the alert.<DAYS>
: Number of days in the time window (default is 7).<NUM-JOBS>
: Number of jobs with 0% CPU utilization.<TABLE>
: Table of job data.<JOBSTATS>
: Thejobstats
command for the first job of the user.
Usage
Generate a report for system adminstrators:
Send emails to offending users:
See which users have received emails and when:
cron
Below is an example crontab
entry: