Too Many Allocated CPU-Cores per GPU
This alert identifies jobs that are allocating too many CPU-cores per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU-cores to accept new jobs.
Configuration File
Below is an example entry for config.yaml
:
too-many-cores-per-gpu-1:
cluster: della
partitions:
- gpu
cores_per_node: 96
gpus_per_node: 8
cores_per_gpu_target: 12
cores_per_gpu_limit: 18
email_file: "too_many_cores_per_gpu.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
gpus_per_node
: Number of GPUs per node. -
cores_per_gpu_target
: This should be the number of CPU-cores divided by the number of GPUs per node. For instance, for nodes with 96 cores and 8 GPUs, one should use 96/8=12. -
cores_per_gpu_limit
: Include jobs where the number of CPU-cores per GPU is equal to or greater than this value. One may set this equal tocores_per_gpu_target
or a value slightly larger. -
cores_per_node
: Number of CPU-cores per node. -
email_file
: The text file to be used for the email message to users. -
min_run_time
: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, ifmin_run_time: 30
is used then jobs that ran for less than 30 minutes are ignored. Default: 0 -
include_running_jobs
: (Optional) IfTrue
then jobs in a state ofRUNNING
will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of users to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
Below is an example report:
$ python job_defense_shield.py --too-many-cores-per-gpu
Too Many CPU-Cores Per GPU
-------------------------------------------------------------------------------
JobID User Hours CPU-Eff GPUs Cores-per-GPU Cores-per-GPU-Target Emails
-------------------------------------------------------------------------------
62675166 u79355 1.8 5% 1 48 12 2 (1)
62733079 u73812 1.3 15% 2 32 12 0
62735106 u73812 1.4 15% 2 32 12 0
62950436 u23992 1.2 7% 1 32 12 0
62952770 u23992 1.2 1% 1 32 12 0
-------------------------------------------------------------------------------
Cluster: della
Partitions: gpu
Start: Tue Mar 04, 2025 at 11:32 AM
End: Tue Mar 18, 2025 at 11:32 AM
Email Message to Users
Below is an example email (see email/too_many_cores_per_gpu.txt
):
Hello Alan (u12345),
Your Della (PLI) jobs may be using more CPU-cores per GPU than necessary:
JobID Hours CPU-Eff Cores GPUs Cores-per-GPU Cores-per-GPU-Target
62733079 1.3 15% 64 2 32 12
62735106 1.4 15% 64 2 32 12
Each node on Della (PLI) has 96 CPU-cores and 8 GPUs. If possible please try
to use only up to 12 CPU-cores per GPU. This will prevent the situation
where there are free GPUs on a node but not enough CPU-cores to accept new
jobs. CPU-Eff is the CPU efficiency.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<TARGET>
: The soft limit for the number of CPU-cores per GPU (i.e.,cores_per_gpu_target
).<CORES>
: Cores per node (i.e.,cores_per_node
).<GPUS>
: Number of GPUs per node (i.e.,gpus_per_node
).<DAYS>
: Number of days in the time window (default is 7).<NUM-JOBS>
: Total number of jobs with at least one idle GPU.<TABLE>
: Table of job data.<JOBSTATS>
: Thejobstats
command for the first job of the user.
Usage
Generate a report of the jobs allocating too many CPU cores per GPU:
Send emails to offending users:
See which users have received emails and when:
cron
Below is an example entry for crontab
: