Too Much Allocated CPU Memory per GPU

This alert identifies jobs that allocate too much CPU memory per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU memory to accept new jobs.

Configuration File

Below is an example entry for config.yaml:

too-much-cpu-mem-per-gpu-1:
  cluster: della
  partitions:
    - gpu
    - llm
  cores_per_node: 96           # count
  gpus_per_node: 8             # count
  cpu_mem_per_node: 1000       # GB
  cpu_mem_per_gpu_target: 115  # GB
  cpu_mem_per_gpu_limit: 128   # GB
  mem_eff_thres: 0.8           # [0.0, 1.0]
  min_run_time: 30             # minutes
  email_file: "too_much_cpu_mem_per_gpu.txt"
  admin_emails:
    - admin@institution.edu

The available settings are listed below:

cluster: Specify the cluster name as it appears in the Slurm database.
partitions: Specify one or more Slurm partitions.
cores_per_node: Number of CPU-cores per node.
gpus_per_node: Number of GPUs per node.
cpu_mem_per_node: Total CPU memory per node in units of GB.
cpu_mem_per_gpu_target: This should a value slightly less than the total CPU memory divided by the number of GPUs per node in GB. For instance, nodes with 1000 GB of memory and 8 GPUs might use cpu_mem_per_gpu_target: 120. The idea is to save some memory for the operating system. This setting may be interpreted as a soft limit.
cpu_mem_per_gpu_limit: Identify jobs with a CPU memory allocation per GPU greater than this value. It is reasonable to use the same value as cpu_mem_per_gpu_target for this setting.
email_file: The text file to be used for the email message to users.
min_run_time: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, if min_run_time: 30 is used then jobs that ran for less than 30 minutes are ignored. Default: 0
mem_eff_thres: (Optional) Ignore jobs where the ratio of used to allocated CPU memory is greater than or equal to this value. Default: 1.0
include_running_jobs: (Optional) If True then jobs in a state of RUNNING will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.
excluded_users: (Optional) List of usernames to exclude from receiving emails.
admin_emails: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users.
email_subject: (Optional) Subject of the email message to users.
report_title: (Optional) Title of the report to system administrators.

Report for System Administrators

An example report is shown below:

$ job_defense_shield --too-much-cpu-mem-per-gpu

                         Too Much CPU Memory Per GPU                          
-------------------------------------------------------------------------------
 JobID    User  Hours Mem-Eff GPUs CPU-Mem-per-GPU CPU-Mem-per-GPU-Limit Emails
-------------------------------------------------------------------------------
6279176  u29427   23     8%     1       500 GB            240 GB          1 (4)
6279179  u29427   31     2%     1       500 GB            240 GB          1 (4)
6270434  u15404   48     6%     1       512 GB            240 GB          3 (2)
6293177  u15404  1.4     1%     1       512 GB            240 GB          3 (2)
6291411  u81448  3.8     0%     1       512 GB            240 GB          0    
6283444  u35452  1.8     0%     1       500 GB            240 GB          1 (5)
-------------------------------------------------------------------------------
   Cluster: della
Partitions: gpu
     Start: Sun Mar 09, 2025 at 10:54 PM
       End: Sun Mar 16, 2025 at 10:54 PM

Email Message to Users

Below is an example email (see email/too_much_cpu_mem_per_gpu.txt):

Hello Alan (u12345),

Your Della (PLI) jobs appear to be allocating more CPU memory than necessary:

     JobID   Hours Mem-Eff CPU-Mem  GPUs CPU-Mem-per-GPU CPU-Mem-per-GPU-Limit
    62733079  1.3    17%    512 GB   2        256 GB             115 GB       
    62735106  1.4    12%    512 GB   2        256 GB             115 GB       

Each node on Della (PLI) has 1000 GB of CPU memory and 8 GPUs. If possible
please only allocate up to the soft limit of 115 GB of CPU memory per GPU. This
will prevent the situation where there are free GPUs on a node but not enough
CPU memory to accept new jobs.

"Mem-Eff" is the memory efficiency or the ratio of used to allocated CPU memory.
A good target value for this quantity is 80% and above. Please use an accurate
value for the --mem, --mem-per-cpu or --mem-per-gpu Slurm directive. For job
62733079, one could have used:

    #SBATCH --mem-per-gpu=50G

Replying to this automated email will open a support ticket with Research
Computing.

Placeholders

The following placeholders can be used in the email file:

<GREETING>: The greeting generated by greeting-method.
<CLUSTER>: The cluster specified for the alert.
<PARTITIONS>: The partitions listed for the alert.
<CORES>: Cores per node (i.e., cores_per_node).
<MEMORY>: CPU memory per node (i.e., cpu_mem_per_node).
<MEM-PER-GPU>: Suggested CPU memory per GPU in units of GB for the first job of the user. This is calculated as the max of either 8 GB or 1.2 times the CPU memory usage per GPU of the first job of the user.
<GPUS>: GPUs per node (i.e., gpus_per_node).
<TARGET>: The soft limit for CPU memory per GPU in GB (i.e., cpu_mem_per_gpu_target).
<DAYS>: Number of days in the time window (default is 7).
<TABLE>: Table of job data.
<JOBSTATS>: The jobstats command for the first job of the user.
<JOBID>: The JobID of the first job of the user.

Usage

Generate a report of the jobs allocating too much CPU memory per GPU:

$ job_defense_shield --too-much-cpu-mem-per-gpu

Email users about allocating too much CPU memory per GPU:

$ job_defense_shield --too-much-cpu-mem-per-gpu --email

See which users have received emails and when:

$ job_defense_shield --too-much-cpu-mem-per-gpu --check

cron

Below is an example entry for crontab:

0 9 * * * /path/to/job_defense_shield --too-much-cpu-mem-per-gpu --email > /path/to/log/too_much_cpu_mem_per_gpu.log 2>&1