Job Defense Shield

Job Defense Shield is a software tool for identifying and reducing instances of underutilization by the users of high-performance computing systems. The software sends automated email alerts to users and creates reports for system administrators. Job Defense Shield can be used to automatically cancel GPU jobs at 0% utilization.

Below is an example report for 0% GPU utilization:

                         GPU-Hours at 0% Utilization
---------------------------------------------------------------------
    User   GPU-Hours-At-0%  Jobs             JobID             Emails
---------------------------------------------------------------------
1  u12998        308         39   62285369,62303767,62317153+   1 (7)
2  u9l487         84         14   62301737,62301738,62301742+   0         
3  u39635         25          2            62184669,62187323    2 (4)         
4  u24074         24         13   62303182,62303183,62303184+   0         
---------------------------------------------------------------------
   Cluster: della
Partitions: gpu, llm
     Start: Wed May 14, 2025 at 09:50 AM
       End: Wed May 21, 2025 at 09:50 AM

Below is an example email to a user that is requesting too much CPU memory:

Hi Alan (u12345),

Below are your jobs that ran on the Stellar cluster in the past 7 days:

     JobID   Memory-Used  Memory-Allocated  Percent-Used  Cores  Hours
    5761066      2 GB          100 GB            2%         1     48
    5761091      4 GB          100 GB            4%         1     48
    5761092      3 GB          100 GB            3%         1     48

It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:

    https://your-institution.edu/knowledge-base/memory

Replying to this automated email will open a support ticket with Research
Computing.