Installation
We assume that the Jobstats platform is available and working.
Cancelling Jobs at 0% GPU Utilization
To automatically cancel actively running jobs, the software must be ran as a user with sufficient privileges to call scancel
. This may inform your decision of where to install the software and by which user. All of the other alerts and reports can be ran as a regular user.
Python 3.7 is the minimum supported version. The required dependencies for Job Defense Shield are pandas
, pyarrow
, pyyaml
and requests
. The requests
module is needed to address the underutilization of actively running jobs. In this case, the Prometheus server must be queried.
Job Defense Shield can be installed from PyPI:
To install the software into an isolated environment:
$ python -m venv jds-env
$ source jds-env/bin/activate
(jds-env) $ pip install --upgrade pip
(jds-env) $ pip install job-defense-shield
At a later time, to upgrade to the latest version:
It is strongly recommended to use the latest version.
Testing the Installation
The simplest test is to run the help menu:
If the command above failed then see Troubleshooting the Installation.
Next, try running a simple informational alert. To do this, make a trivial configuration file called config.yaml
in the current working directory:
%YAML 1.1
---
#####################
## GLOBAL SETTINGS ##
#####################
violation-logs-path: /path/to/writable/directory/
email-files-path: /path/to/readable/directory/
email-domain-name: "@institution.edu"
sender: support@institution.edu
reply-to: support@institution.edu
report-emails:
- admin@institution.edu
Tip
If the path that you specified for violation-logs-path
does not exist then the software will try to make it. You can use any valid path for now. In the next section, you will choose the production paths for violation-logs-path
and email-files-path
.
Be sure to replace email-domain-name
, sender
, reply-to
and report-emails
with your values.
To test the software, run this command (which does not send any emails):
The command above will show an overview of the number of CPU-hours and GPU-hours across all clusters and partitions in the Slurm database over the past 7 days. Here is an example:
$ job_defense_shield --usage-overview
Job Defense Shield (1.x.y)
github.com/PrincetonUniversity/job_defense_shield
INFO: Fri May 9, 2025 at 5:00 PM
INFO: Python 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
INFO: Pandas 2.2.3
INFO: Configuration file is /home/aturing/software/jds/config.yaml
INFO: Calling sacct ... done (15 seconds).
INFO: Cleaning sacct data
268639 jobs in the raw dataframe
258427 jobs in the cleaned dataframe
Usage Overview
-----------------------------------------
Cluster Users CPU-Hours GPU-Hours
-----------------------------------------
della 464 1285938 (16%) 91714 (65%)
stellar 149 6745324 (82%) 1926 (1%)
traverse 1 189987 (2%) 47497 (34%)
-----------------------------------------
Usage Overview by Partition
------------------------------------------------------
Cluster Partition Users CPU-Hours GPU-Hours
------------------------------------------------------
della cpu 311 874114 (68%) 0 (0%)
della pli-c 28 115406 (9%) 25838 (28%)
della gpu-shared 98 83617 (7%) 30083 (33%)
della datasci 31 80954 (6%) 0 (0%)
della gpu 51 47475 (4%) 16503 (18%)
della pli 20 35814 (3%) 6249 (7%)
della cryoem 17 20897 (2%) 4110 (4%)
della physics 5 12968 (1%) 0 (0%)
della mig 41 7169 (1%) 7169 (8%)
della pli-lc 5 3107 (0%) 1081 (1%)
della gputest 99 1948 (0%) 647 (1%)
della all 1 1280 (0%) 0 (0%)
della monia 4 1003 (0%) 0 (0%)
della gpu-ee 2 173 (0%) 23 (0%)
della grace 1 11 (0%) 11 (0%)
della salik 1 2 (0%) 0 (0%)
stellar cimes 21 2941001 (44%) 0 (0%)
stellar pu 56 2426873 (36%) 0 (0%)
stellar pppl 33 1340776 (20%) 0 (0%)
stellar serial 41 13187 (0%) 0 (0%)
stellar all 48 12377 (0%) 0 (0%)
stellar gpu 20 11044 (0%) 1926 (100%)
stellar bigmem 1 66 (0%) 0 (0%)
traverse all 1 189987 (100%) 47497 (100%)
------------------------------------------------------
Start: Fri May 02, 2025 at 05:00 PM
End: Fri May 09, 2025 at 05:00 PM
You can go further back in time by using the --days
option:
Info
Using a large value for the --days
option can cause the Slurm database to fail to produce the data. The default is 7 days.
One can only include data from specific clusters or partitions using the -M
and -r
options from sacct
:
The -M
and -r
options (or --clusters
and --partition
) can be used to reduce the load on the database server when an alert only applies to a particular cluster or particular partitions. These options are passed through to sacct
. See man sacct
for more information.
Email Test
By having your email address in report-emails
in config.yaml
, the --report
flag can be used to send the output to administrators by email:
This feature is useful when combined with cron
. That is, one can receive a daily report showing all of the instances of underutilization across all of the systems (see reports).
Troubleshooting the Installation
Make sure you are using the python
in the isolated environment. All three commands below should run successfully:
$ python --version
$ python -c "import pandas; print(pandas.__version__)"
$ python -c "import pyyaml; print(pyyaml.__version__)"
If the configuration file is not found then try specifying the full path:
Creating a Configuration File for Production
See the next section to learn how to write a proper configuration file.