Job Defense Shield
High-performance computing clusters often serve a large number of users who posses a range of knowledge and skills. This leads to individuals misusing the resources due to mistakes, misunderstandings, expediency, and related issues. To combat jobs that waste or misuse the resources, a battery of alerts can be configured. While such alerts can be configured in Prometheus, the most flexible and powerful solution is external software.
Job Defense Shield is a Python code for sending automated email alerts to users and for creating reports for system administrators. As discussed above, summary statistics for each completed job are stored in a compressed format in the AdminComment
field in the Slurm database. The software described here works by calling the Slurm sacct
command while requesting several fields including AdminComment
. The sacct
output is stored in a pandas
dataframe for processing.
Automated email alerts to users are available for these cases:
- CPU or GPU jobs with 0% utilization (see email below)
- Heavy users with low mean CPU or GPU efficiency
- Jobs that allocate excess CPU memory (see email below)
- Serial jobs that allocate multiple CPU-cores
- Users that routinely run with excessive time limits
- Jobs that could have used a smaller number of nodes
- Jobs that could have used less powerful GPUs
- Jobs thar ran on specialized nodes but did not need to
All of the instances in the list above can be formulated as a report for system administrators. The most popular reports for system administrators are:
- A list of users (and their jobs) with the most GPU-hours at 0% utilization
- A list of the heaviest users with low CPU/GPU utilization
- A list of users that are over-allocating the most CPU memory
- A list of users that are over-allocating the most time
The Python code is written using object-oriented programming techniques which makes it easy to create new alerts and reports.
Example Emails
Below is an example email for the automatic cancellation of a GPU job with 0% utilization:
Hi Alan,
The jobs below have been cancelled because they ran for nearly 2 hours at 0% GPU
utilization:
JobID Cluster Partition State GPUs-Allocated GPU-Util Hours
60131148 della llm CANCELLED 4 0% 2.0
60131741 della llm CANCELLED 4 0% 1.9
See our GPU Computing webpage for three common reasons for encountering zero GPU
utilization:
https://<your-institution>.edu/knowledge-base/gpu-computing
Replying to this automated email will open a support ticket with Research
Computing.
Below is an example email to a user that is requesting too much CPU memory:
Hi Alan,
Below are your jobs that ran on BioCluster in the past 7 days:
JobID Memory-Used Memory-Allocated Percent-Used Cores Hours
5761066 2 GB 100 GB 2% 1 48
5761091 4 GB 100 GB 4% 1 48
5761092 3 GB 100 GB 3% 1 48
It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:
https://<your-institution>.edu/knowledge-base/memory
Replying to this automated email will open a support ticket with Research
Computing.
Usage
The software has a check
mode that shows on which days a given user received an alert of a given type. Users that appear to be ignoring the email alerts can be contacted directly. Emails to users are most effective when sent sparingly. For this reason, there is a command-line parameter to specify the amount of time that must pass before the user can receive another email of the same nature.
The example below shows how the script is called to notify users in the top N by usage with low CPU or GPU efficiencies over the last week:
The default thresholds are 60% and 15% for CPU and GPU utilization, respectively, and N=15.
Installation
The installation requirements for Job Defense Shield are Python 3.6+ and version 1.2+ of the Python pandas
package. The jobstats
command is also required if one wants to examine actively running jobs such as when looking for jobs with zero GPU utilization. The Python code, example alerts and emails, and instructions are available in the GitHub repository.