Installation
We assume that the Jobstats platform is available and working.
Cancelling Jobs at 0% GPU Utilization
To automatically cancel actively running jobs, the software must be ran as a user with sufficient privileges to call scancel
. This may inform your decision of where to install the software. All of the other alerts can be ran as a regular user.
The installation requirements for Job Defense Shield are pandas
and pyyaml
. The requests
module is needed only if one wants to address the underutilization of actively running jobs. In this case, the Prometheus server must be queried.
Python 3.6+ and pandas 1.2+ are the minimum required versions. We recommend using newer versions such as Python 3.10+ and pandas 2.1+.
Here are some ways the requirements can be installed:
Next, pull down the repository:
Testing the Installation
The simplest test is to run the help menu:
If the command above failed then see Troubleshooting the Installation.
Next, try running a simple informational alert. To do this, make a trivial configuration file called config.yaml
in the same directory as job_defense_shield.py
:
%YAML 1.1
---
#####################
## GLOBAL SETTINGS ##
#####################
violation-logs-path: /path/to/writable/directory/
email-domain-name: "@institution.edu"
sender: support@institution.edu
reply-to: support@institution.edu
report-emails:
- admin@institution.edu
Tip
If the path that you use for violation-logs-path
does not exist then the software will try to make it. You can use anything for now. In the next section, you will choose a path to use for production.
Be sure to replace email-domain-name
, sender
, reply-to
and report-emails
with your values.
To test the software, run this command (which does not send any emails):
The command above will show an overview of the number of CPU-hours and GPU-hours across all clusters and partitions in the Slurm database over the past 7 days. Here is an example:
$ python job_defense_shield.py --usage-overview
Usage Overview
-----------------------------------------
cluster users cpu-hours gpu-hours
-----------------------------------------
della 464 1285938 (16%) 91714 (65%)
stellar 149 6745324 (82%) 1926 (1%)
traverse 1 189987 (2%) 47497 (34%)
-----------------------------------------
Usage Overview by Partition
------------------------------------------------------
cluster partition users cpu-hours gpu-hours
------------------------------------------------------
della cpu 311 874114 (68%) 0 (0%)
della pli-c 28 115406 (9%) 25838 (28%)
della gpu-shared 98 83617 (7%) 30083 (33%)
della datasci 31 80954 (6%) 0 (0%)
della gpu 51 47475 (4%) 16503 (18%)
della pli 20 35814 (3%) 6249 (7%)
della cryoem 17 20897 (2%) 4110 (4%)
della physics 5 12968 (1%) 0 (0%)
della mig 41 7169 (1%) 7169 (8%)
della pli-lc 5 3107 (0%) 1081 (1%)
della gputest 99 1948 (0%) 647 (1%)
della all 1 1280 (0%) 0 (0%)
della donia 4 1003 (0%) 0 (0%)
della gpu-ee 2 173 (0%) 23 (0%)
della grace 1 11 (0%) 11 (0%)
della malik 1 2 (0%) 0 (0%)
stellar cimes 21 2941001 (44%) 0 (0%)
stellar pu 56 2426873 (36%) 0 (0%)
stellar pppl 33 1340776 (20%) 0 (0%)
stellar serial 41 13187 (0%) 0 (0%)
stellar all 48 12377 (0%) 0 (0%)
stellar gpu 20 11044 (0%) 1926 (100%)
stellar bigmem 1 66 (0%) 0 (0%)
traverse all 1 189987 (100%) 47497 (100%)
------------------------------------------------------
Start: Fri Mar 07, 2025 at 11:27 AM
End: Fri Mar 14, 2025 at 11:27 AM
You can go further back in time by using the --days
option:
Info
Using a large value for the --days
option can cause the Slurm database to fail to produce the data. The default is 7 days.
One can only include data from specific clusters or partitions using the -M
and -r
options from sacct
:
The -M
and -r
options (or --clusters
and --partition
) can be used to reduce the load on the database server when an alert only applies to a particular cluster or particular partitions. These options are passed through to sacct
. See man sacct
for more information.
Email Test
By having your email address in report-emails
in config.yaml
, the --report
flag can be used to send the output to administrators by email:
This feature is useful when combined with cron
. That is, one can receive a daily report showing all of the instances of underutilization across all of the systems (see reports).
Troubleshooting the Installation
Make sure you are using the right python
. All three commands below should run successfully:
$ python -c "import sys; print(sys.version)"
$ python -c "import pandas; print(pandas.__version__)"
$ python -c "import pyyaml; print(pyyaml.__version__)"
If the configuration file is not found then try specifying the full path:
Creating a Configuration File for Production
See the next section to learn how to write a proper configuration file.