Too Many Allocated CPU-Cores per GPU

This alert identifies jobs that are allocating too many CPU-cores per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU-cores to accept new jobs.

Configuration File

Below is an example entry for config.yaml:

too-many-cores-per-gpu-1:
  cluster: della
  partitions:
    - gpu
  cores_per_node: 96
  gpus_per_node: 8
  cores_per_gpu_target: 12
  cores_per_gpu_limit: 18
  email_file: "too_many_cores_per_gpu.txt"
  admin_emails:
    - admin@institution.edu

The available settings are listed below:

cluster: Specify the cluster name as it appears in the Slurm database.
partitions: Specify one or more Slurm partitions.
gpus_per_node: Number of GPUs per node.
cores_per_gpu_target: This should be the number of CPU-cores divided by the number of GPUs per node. For instance, for nodes with 96 cores and 8 GPUs, one should use 96/8=12.
cores_per_gpu_limit: Include jobs where the number of CPU-cores per GPU is equal to or greater than this value. One may set this equal to cores_per_gpu_target or a value slightly larger.
cores_per_node: Number of CPU-cores per node.
email_file: The text file to be used for the email message to users.
min_run_time: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, if min_run_time: 30 is used then jobs that ran for less than 30 minutes are ignored. Default: 0
include_running_jobs: (Optional) If True then jobs in a state of RUNNING will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.
excluded_users: (Optional) List of users to exclude from receiving emails.
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.
email_subject: (Optional) Subject of the email message to users.
report_title: (Optional) Title of the report to system administrators.

Report for System Administrators

Below is an example report:

$ job_defense_shield --too-many-cores-per-gpu

                                Too Many CPU-Cores Per GPU                                   
-------------------------------------------------------------------------------
 JobID     User  Hours CPU-Eff  GPUs Cores-per-GPU  Cores-per-GPU-Target Emails
-------------------------------------------------------------------------------
62675166  u79355  1.8     5%      1         48                12          2 (1)
62733079  u73812  1.3    15%      2         32                12          0   
62735106  u73812  1.4    15%      2         32                12          0   
62950436  u23992  1.2     7%      1         32                12          0   
62952770  u23992  1.2     1%      1         32                12          0   
-------------------------------------------------------------------------------
   Cluster: della
Partitions: gpu
     Start: Tue Mar 04, 2025 at 11:32 AM
       End: Tue Mar 18, 2025 at 11:32 AM

Email Message to Users

Below is an example email (see email/too_many_cores_per_gpu.txt):

Hello Alan (u12345),

Your Della (PLI) jobs may be using more CPU-cores per GPU than necessary:

    JobID   Hours CPU-Eff  Cores  GPUs Cores-per-GPU  Cores-per-GPU-Target
   62733079  1.3    15%     64     2         32                12         
   62735106  1.4    15%     64     2         32                12         

Each node on Della (PLI) has 96 CPU-cores and 8 GPUs. If possible please try
to use only up to 12 CPU-cores per GPU. This will prevent the situation
where there are free GPUs on a node but not enough CPU-cores to accept new
jobs. CPU-Eff is the CPU efficiency.

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

<GREETING>: The greeting generated by greeting-method.
<CLUSTER>: The cluster specified for the alert (i.e., cluster).
<PARTITIONS>: The partitions listed for the alert (i.e., partitions).
<TARGET>: The soft limit for the number of CPU-cores per GPU (i.e., cores_per_gpu_target).
<CORES>: Cores per node (i.e., cores_per_node).
<GPUS>: Number of GPUs per node (i.e., gpus_per_node).
<DAYS>: Number of days in the time window (default is 7).
<NUM-JOBS>: Total number of jobs with at least one idle GPU.
<TABLE>: Table of job data.
<JOBSTATS>: The jobstats command for the first job of the user.

Usage

Generate a report of the jobs allocating too many CPU cores per GPU:

$ job_defense_shield --too-many-cores-per-gpu

Send emails to offending users:

$ job_defense_shield --too-many-cores-per-gpu --email

See which users have received emails and when:

$ job_defense_shield --too-many-cores-per-gpu --check

cron

Below is an example entry for crontab:

0 9 * * * /path/to/job_defense_shield --too-many-cores-per-gpu --email > /path/to/log/too_many_cores_per_gpu.log 2>&1