Too Many Allocated CPU-Cores per GPU
This alert identifies jobs that are allocating too many CPU-cores per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU-cores to accept new jobs.
Configuration File
Below is an example entry for config.yaml:
too-many-cores-per-gpu-1:
  cluster: della
  partitions:
    - gpu
  cores_per_node: 96
  gpus_per_node: 8
  cores_per_gpu_target: 12
  cores_per_gpu_limit: 18
  email_file: "too_many_cores_per_gpu.txt"
  admin_emails:
    - admin@institution.edu
The available settings are listed below:
- 
cluster: Specify the cluster name as it appears in the Slurm database. - 
partitions: Specify one or more Slurm partitions. Use"*"to include all partitions (i.e.,partitions: ["*"]). - 
gpus_per_node: Number of GPUs per node. - 
cores_per_gpu_target: This should be the number of CPU-cores divided by the number of GPUs per node. For instance, for nodes with 96 cores and 8 GPUs, one should use 96/8=12. - 
cores_per_gpu_limit: Include jobs where the number of CPU-cores per GPU is equal to or greater than this value. One may set this equal tocores_per_gpu_targetor a value slightly larger. - 
cores_per_node: Number of CPU-cores per node. - 
email_file: The text file to be used for the email message to users. 
gpu_hours_threshold: (Optional) Minimum number of GPU-hours (summed over the jobs) for the user to be considered. This setting makes it possible to ignore users that are not consuming many resources. Default: 0
- 
cpu_eff_threshold: (Optional) Ignore jobs with a CPU efficiency greater than this value. Default: 100 - 
min_run_time: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, ifmin_run_time: 30is used then jobs that ran for less than 30 minutes are ignored. Default: 0 - 
include_running_jobs: (Optional) IfTruethen jobs in a state ofRUNNINGwill be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False - 
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. - 
excluded_qos: (Optional) List of QOSes to exclude from this alert. - 
excluded_partitions: (Optional) List of partitions to exclude from this alert. This is useful whenpartitions: ["*"]is used. - 
excluded_users: (Optional) List of users to exclude from receiving emails. - 
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. - 
email_subject: (Optional) Subject of the email message to users. - 
report_title: (Optional) Title of the report to system administrators. 
Report for System Administrators
Below is an example report:
$ job_defense_shield --too-many-cores-per-gpu
                                Too Many CPU-Cores Per GPU                                   
-------------------------------------------------------------------------------
 JobID     User  Hours CPU-Eff  GPUs Cores-per-GPU  Cores-per-GPU-Target Emails
-------------------------------------------------------------------------------
62675166  u79355  1.8     5%      1         48                12          2 (1)
62733079  u73812  1.3    15%      2         32                12          0   
62735106  u73812  1.4    15%      2         32                12          0   
62950436  u23992  1.2     7%      1         32                12          0   
62952770  u23992  1.2     1%      1         32                12          0   
-------------------------------------------------------------------------------
   Cluster: della
Partitions: gpu
     Start: Tue Mar 04, 2025 at 11:32 AM
       End: Tue Mar 18, 2025 at 11:32 AM
Email Message to Users
Below is an example email (see email/too_many_cores_per_gpu.txt):
Hello Alan (u12345),
Your Della (PLI) jobs may be using more CPU-cores per GPU than necessary:
    JobID   Hours CPU-Eff  Cores  GPUs Cores-per-GPU  Cores-per-GPU-Target
   62733079  1.3    15%     64     2         32                12         
   62735106  1.4    15%     64     2         32                12         
Each node on Della (PLI) has 96 CPU-cores and 8 GPUs. If possible please try
to use only up to 12 CPU-cores per GPU. This will prevent the situation
where there are free GPUs on a node but not enough CPU-cores to accept new
jobs. CPU-Eff is the CPU efficiency.
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>: The greeting generated bygreeting-method.<CLUSTER>: The cluster specified for the alert (i.e.,cluster).<PARTITIONS>: A comma-separated list of partitions used by the user.<TARGET>: The soft limit for the number of CPU-cores per GPU (i.e.,cores_per_gpu_target).<CORES>: Cores per node (i.e.,cores_per_node).<GPUS>: Number of GPUs per node (i.e.,gpus_per_node).<DAYS>: Number of days in the time window (default is 7).<NUM-JOBS>: Total number of jobs with at least one idle GPU.<TABLE>: Table of job data.<JOBSTATS>: Thejobstatscommand for the first job of the user.
Usage
Generate a report of the jobs allocating too many CPU cores per GPU:
Send emails to offending users:
See which users have received emails and when:
cron
Below is an example entry for crontab: