Too Many Allocated CPU-Cores per GPU
This alert identifies jobs that are allocating too many CPU-cores per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU-cores to accept new jobs.
Configuration File
Below is an example entry for config.yaml:
too-many-cores-per-gpu-1:
cluster: della
partitions:
- gpu
cores_per_node: 96
gpus_per_node: 8
cores_per_gpu_target: 12
cores_per_gpu_limit: 18
email_file: "too_many_cores_per_gpu.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster: Specify the cluster name as it appears in the Slurm database. -
partitions: Specify one or more Slurm partitions. Use"*"to include all partitions (i.e.,partitions: ["*"]). -
gpus_per_node: Number of GPUs per node. -
cores_per_gpu_target: This should be the number of CPU-cores divided by the number of GPUs per node. For instance, for nodes with 96 cores and 8 GPUs, one should use 96/8=12. -
cores_per_gpu_limit: Only consider jobs where the number of CPU-cores per GPU is greater than this value. One should set this parameter equal tocores_per_gpu_targetor a value slightly larger. -
cores_per_node: Number of CPU-cores per node. -
email_file: The text file to be used for the email message to users. -
gpu_hours_threshold: (Optional) Minimum number of GPU-hours (summed over the jobs) for the user to be considered. This setting makes it possible to ignore users that are not consuming many resources. Default: 0 -
cpu_eff_threshold: (Optional) Ignore jobs with a CPU efficiency (as a percentage) greater than this value. For instance, to ignore jobs with CPU efficiency of greater than 80% usecpu_eff_threshold: 80. Default: 100 -
min_run_time: (Optional) Minimum run time in minutes for a job to be considered. For example, ifmin_run_time: 30is used then jobs that ran for less than 30 minutes are ignored. Default: 0 -
include_running_jobs: (Optional) IfTruethen jobs in a state ofRUNNINGwill be considered. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_qos: (Optional) List of QOSes to exclude from this alert. -
excluded_partitions: (Optional) List of partitions to exclude from this alert. This is useful whenpartitions: ["*"]is used. -
excluded_users: (Optional) List of users to exclude from receiving emails. -
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject: (Optional) Subject of the email message to users. -
report_title: (Optional) Title of the report to system administrators.
Report for System Administrators
Below is an example report:
$ job_defense_shield --too-many-cores-per-gpu
Too Many CPU-Cores Per GPU
-------------------------------------------------------------------------------
JobID User Hours CPU-Eff GPUs Cores-per-GPU Cores-per-GPU-Target Emails
-------------------------------------------------------------------------------
62675166 u79355 1.8 5% 1 48 12 2 (1)
62733079 u12345 1.3 15% 2 32 12 0
62735106 u12345 1.4 15% 2 32 12 0
62950436 u23992 1.2 7% 1 32 12 0
62952770 u23992 1.2 1% 1 32 12 0
-------------------------------------------------------------------------------
Cluster: della
Partitions: gpu
Start: Tue Mar 04, 2025 at 11:32 AM
End: Tue Mar 11, 2025 at 11:32 AM
Email Message to Users
Below is an example email (see email/too_many_cores_per_gpu.txt):
Hello Alan (u12345),
Your Della (PLI) jobs may be using more CPU-cores per GPU than necessary:
JobID Hours CPU-Eff Cores GPUs Cores-per-GPU Cores-per-GPU-Target
62733079 1.3 15% 64 2 32 12
62735106 1.4 15% 64 2 32 12
Each node on Della (PLI) has 96 CPU-cores and 8 GPUs. If possible please try
to use only up to 12 CPU-cores per GPU. This will prevent the situation
where there are free GPUs on a node but not enough CPU-cores to accept new
jobs. CPU-Eff is the CPU efficiency.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>: The greeting generated bygreeting-method.<CLUSTER>: The cluster specified for the alert (i.e.,cluster).<PARTITIONS>: A comma-separated list of partitions used by the user.<TARGET>: The soft limit for the number of CPU-cores per GPU (i.e.,cores_per_gpu_target).<CORES>: Cores per node (i.e.,cores_per_node).<GPUS>: Number of GPUs per node (i.e.,gpus_per_node).<DAYS>: Number of days in the time window (default is 7).<NUM-JOBS>: Total number of jobs with at least one idle GPU.<TABLE>: Table of job data.<JOBSTATS>: Thejobstatscommand for the first job of the user.
Usage
Generate a report of the jobs allocating too many CPU cores per GPU:
Send emails to offending users:
See which users have received emails and when:
cron
Below is an example entry for crontab: