Too Much Allocated CPU Memory per GPU
This alert identifies jobs that allocate too much CPU memory per GPU. The goal is to prevent the situation where there are free GPUs on a node but not enough CPU memory to accept new jobs.
Configuration File
Below is an example entry for config.yaml
:
too-much-cpu-mem-per-gpu-1:
cluster: della
partitions:
- gpu
- llm
cores_per_node: 96 # count
gpus_per_node: 8 # count
cpu_mem_per_node: 1000 # GB
cpu_mem_per_gpu_target: 115 # GB
cpu_mem_per_gpu_limit: 128 # GB
mem_eff_thres: 0.8 # [0.0, 1.0]
min_run_time: 30 # minutes
email_file: "too_much_cpu_mem_per_gpu.txt"
admin_emails:
- admin@institution.edu
The available settings are listed below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. -
partitions
: Specify one or more Slurm partitions. -
cores_per_node
: Number of CPU-cores per node. -
gpus_per_node
: Number of GPUs per node. -
cpu_mem_per_node
: Total CPU memory per node in units of GB. -
cpu_mem_per_gpu_target
: This should a value slightly less than the total CPU memory divided by the number of GPUs per node in GB. For instance, nodes with 1000 GB of memory and 8 GPUs might usecpu_mem_per_gpu_target: 120
. The idea is to save some memory for the operating system. This setting may be interpreted as a soft limit. -
cpu_mem_per_gpu_limit
: Identify jobs with a CPU memory allocation per GPU greater than this value. It is reasonable to use the same value ascpu_mem_per_gpu_target
for this setting. -
email_file
: The text file to be used for the email message to users. -
min_run_time
: (Optional) Minimum run time in minutes for a job to be included in the calculation. For example, ifmin_run_time: 30
is used then jobs that ran for less than 30 minutes are ignored. Default: 0 -
mem_eff_threshold
: (Optional) Ignore jobs where the ratio of used to allocated CPU memory is greater than or equal to this value. Default: 1.0 -
include_running_jobs
: (Optional) IfTrue
then jobs in a state ofRUNNING
will be included in the calculation. The Prometheus server must be queried for each running job, which can be an expensive operation. Default: False -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from receiving emails. -
admin_emails
: (Optional) List of administrator email addresses that should receive copies of the emails that are sent to users. -
email_subject
: (Optional) Subject of the email message to users. -
report_title
: (Optional) Title of the report to system administrators.
Report for System Administrators
An example report is shown below:
$ python job_defense_shield.py --too-much-cpu-mem-per-gpu
Too Much CPU Memory Per GPU
-------------------------------------------------------------------------------
JobID User Hours Mem-Eff GPUs CPU-Mem-per-GPU CPU-Mem-per-GPU-Limit Emails
-------------------------------------------------------------------------------
6279176 u29427 23 8% 1 500 GB 240 GB 1 (4)
6279179 u29427 31 2% 1 500 GB 240 GB 1 (4)
6270434 u15404 48 6% 1 512 GB 240 GB 3 (2)
6293177 u15404 1.4 1% 1 512 GB 240 GB 3 (2)
6291411 u81448 3.8 0% 1 512 GB 240 GB 0
6283444 u35452 1.8 0% 1 500 GB 240 GB 1 (5)
-------------------------------------------------------------------------------
Cluster: della
Partitions: gpu
Start: Sun Mar 09, 2025 at 10:54 PM
End: Sun Mar 16, 2025 at 10:54 PM
Email Message to Users
Below is an example email (see email/too_much_cpu_mem_per_gpu.txt
):
Hello Alan (u12345),
Your Della (PLI) jobs appear to be allocating more CPU memory than necessary:
JobID Hours Mem-Eff CPU-Mem GPUs CPU-Mem-per-GPU CPU-Mem-per-GPU-Limit
62733079 1.3 17% 512 GB 2 256 GB 115 GB
62735106 1.4 12% 512 GB 2 256 GB 115 GB
Each node on Della (PLI) has 1000 GB of CPU memory and 8 GPUs. If possible
please only allocate up to the soft limit of 115 GB of CPU memory per GPU. This
will prevent the situation where there are free GPUs on a node but not enough
CPU memory to accept new jobs.
"Mem-Eff" is the memory efficiency or the ratio of used to allocated CPU memory.
A good target value for this quantity is 80% and above. Please use an accurate
value for the --mem, --mem-per-cpu or --mem-per-gpu Slurm directive. For job
62733079, one could have used:
#SBATCH --mem-per-gpu=50G
Replying to this automated email will open a support ticket with Research
Computing.
Placeholders
The following placeholders can be used in the email file:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert.<PARTITIONS>
: The partitions listed for the alert.<CORES>
: Cores per node (i.e.,cores_per_node
).<MEMORY>
: CPU memory per node (i.e.,cpu_mem_per_node
).<MEM-PER-GPU>
: Suggested CPU memory per GPU in units of GB for the first job of the user. This is calculated as the max of either 8 GB or 1.2 times the CPU memory usage per GPU of the first job of the user.<GPUS>
: GPUs per node (i.e.,gpus_per_node
).<TARGET>
: The soft limit for CPU memory per GPU in GB (i.e.,cpu_mem_per_gpu_target
).<DAYS>
: Number of days in the time window (default is 7).<TABLE>
: Table of job data.<JOBSTATS>
: Thejobstats
command for the first job of the user.<JOBID>
: TheJobID
of the first job of the user.
Usage
Generate a report of the jobs allocating too much CPU memory per GPU:
Email users about allocating too much CPU memory per GPU:
See which users have received emails and when:
cron
Below is an example entry for crontab
: