Generating Job Summaries
When all four exporters are used, the Prometheus database stores 17 metrics every N seconds for each job. It is recommended to retain this detailed data for several months or longer. This data can be visualized using the Grafana dashboard. After some amount of time, it makes sense to purge the detailed data while keeping only a summary (i.e., CPU/GPU utilization and memory usage per node). The summary data can also be used in place of the detailed data when generating efficiency reports.
A summary of individual job statistics is generated at job completion and stored in the Slurm database (or MariaDB) in the AdminComment field. This is done by a slurmctld epilog script that runs at job completion. For example, in slurm.conf:
The script is available in the Jobstats GitHub repository and needs to be installed on the slurmctld server along with Jobstats. For storage efficiency and convenience, the JSON job summary data is gzipped and base64 encoded before being stored in the AdminComment field of the Slurm database (or MariaDB).
The impact on the database size due to this depends on job sizes. For an institution with 100,000 CPU-cores, for small jobs the AdminComment field tends to average under 50 characters per entry with a maximum under 1500 while for large jobs the maximum length is around 5000.
For processing old jobs where the slurmctld epilog script did not run or for jobs where it failed, there is a per cluster ingest Jobstats service. This is a combination of a Python-based script jobs_with_no_data.py that returns a list of recent jobs with an empty AdminComment and a bash script ingest_jobstats that uses that utility to process those jobs and set AdminComment. Since the jobs_with_no_data.py needs db access it is easiest to run this on the slurmdbd host, either as a cron or a systemd timer and service. These scripts (ingest_jobstats and 'jobs_with_no_data.py') and systemd timer and service units are in the slurm directory of the Jobstats GitHub repository.
Below is an example job summary for a GPU job:
$ jobstats 12345678 -j
{
"gpus": 1,
"nodes": {
"della-k1g1": {
"cpus": 12,
"gpu_total_memory": {
"1": 85899345920
},
"gpu_used_memory": {
"1": 83314868224
},
"gpu_utilization": {
"1": 98.6
},
"total_memory": 137438953472,
"total_time": 57620.8,
"used_memory": 84683702272
}
},
"total_time": 50944
}
The following pipeline shows how the summary statistics are recovered from an AdminComment entry in the database:
$ sacct -j 823722 -n -o admincomment%250 | sed 's/JS1://' | tr -d ' ' \
| base64 -d \
| gzip -d \
| jq
{
"nodes": {
"della-l01g2": {
"total_memory": 33554432000,
"used_memory": 27482066944,
"total_time": 8591.7,
"cpus": 1,
"gpu_total_memory": {
"1": 10200547328
},
"gpu_used_memory": {
"1": 124780544
}
}
},
"total_time": 432301,
"gpus": 1
}
Only completed jobs have entries in the database.
Using pandas to Analyze the Data
The Python code below can be used to work with the summary statistics in AdminComment:
$ wget https://raw.githubusercontent.com/PrincetonUniversity/job_defense_shield/refs/heads/main/src/job_defense_shield/efficiency.py
$ wget https://raw.githubusercontent.com/jdh4/saccta/refs/heads/main/gpu_usage.py
The script gpu_usage.py illustrates how to get the GPU utilization per job. The other functions in efficiency.py can be used to get the other metrics (e.g., cpu_memory_usage). See also the source code for Job Defense Shield where efficiency.py is used.
Analyzing the Prometheus Data
The summary statistics capture a small fraction of the total data associated with each job. To work with several of the metrics, one must query the Prometheus server.
Here is an example of getting the mean power usage per GPU for a job that used 4 GPUs:
$ export SLURM_TIME_FORMAT=%s
$ sacct -j 1191148 -X -o start,end
Start End
------------------- -------------------
1759675685 1759679604
The run time of the job is 1759679604 - 1759675685 = 3919 seconds.
The Python script below can be used to obtain the mean power per GPU:
import json
import requests
params = {'query':'avg_over_time((nvidia_gpu_power_usage_milliwatts{cluster="della"} and nvidia_gpu_jobId == 1191148)[3919s:])',
'time':1759679604}
response = requests.get('http://vigilant2:8480/api/v1/query', params)
data = response.json()
print(json.dumps(data, indent=2))
The output is:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"cluster": "della",
"instance": "della-l02g8:9445",
"job": "Della GPU Nodes",
"jobid": "1191148",
"minor_number": "1",
"name": "NVIDIA A100 80GB PCIe",
"ordinal": "1",
"service": "compute",
"userid": "331233",
"uuid": "GPU-e8ba89df-0d52-4693-7098-1b38647e1462"
},
"value": [
1759679595,
"136074.7816091954"
]
},
{
"metric": {
"cluster": "della",
"instance": "della-l02g8:9445",
"job": "Della GPU Nodes",
"jobid": "1191148",
"minor_number": "3",
"name": "NVIDIA A100 80GB PCIe",
"ordinal": "3",
"service": "compute",
"userid": "331233",
"uuid": "GPU-cd73f312-e09f-6884-cf10-00982b08d58a"
},
"value": [
1759679595,
"126442.79310344828"
]
},
{
"metric": {
"cluster": "della",
"instance": "della-l02g8:9445",
"job": "Della GPU Nodes",
"jobid": "1191148",
"minor_number": "0",
"name": "NVIDIA A100 80GB PCIe",
"ordinal": "0",
"service": "compute",
"userid": "331233",
"uuid": "GPU-207d45af-87b6-798b-b87c-ad7c9e7f6c35"
},
"value": [
1759679595,
"130576.59770114943"
]
},
{
"metric": {
"cluster": "della",
"instance": "della-l02g8:9445",
"job": "Della GPU Nodes",
"jobid": "1191148",
"minor_number": "2",
"name": "NVIDIA A100 80GB PCIe",
"ordinal": "2",
"service": "compute",
"userid": "331233",
"uuid": "GPU-aa72d60c-a2bd-d330-9ad2-61d0b2750c0b"
},
"value": [
1759679595,
"127405.17624521072"
]
}
]
}
}
There are four entries like "value": [1759679595, "127405.17624521072"] which give the power in milliWatts (e.g., 127405 mW or 127 W).