Grafana

The four exporters lead to a wealth of data in the Prometheus database. To visualize this data, the Grafana visualization toolkit is used. To setup Grafana follow the directions at grafana.com.

The Grafana dashboard JSON file, which uses all of the exporters, is included in the grafana subdirectory in the Jobstats GitHub repository. The dashboard expects one parameter, jobid. As it may not be easy to find the time range of the job, we also use an OnDemand helper that generates the correct time range given a jobid (see the next section).

The following job-level metrics are available in both Grafana and the jobstats command:

CPU Utilization
CPU Memory Utilization
GPU Utilization
GPU Memory Utilization

The following additional job-level metrics are exposed only in Grafana:

GPU Power Usage
GPU Temperature

Finally, the following additional node-level metrics are exposed only in Grafana:

CPU Percentage Utilization
Total Memory Utilization
Mean Frequency Over All CPUs
NFS Statistics
Local Disc R/W
GPFS Bandwidth Statistics
Local Disc IOPS
GPFS Operations per Second Statistics
Infiniband Throughput
Infiniband Packet Rate
Infiniband Errors

The complete Grafana interface for the Jobstats platform is composed of plots of the time history of the seventeen quantities above. This graphical interface is used for detailed investigations such as troubleshooting failed jobs, identifying jobs with CPU memory leaks, intermittent GPU usage, load imbalance, and for understanding the anomalous behavior of system hardware.

Note

Eleven of the seventeen metrics above are node-level. This means that if multiple jobs are running on the same node then it will not be possible to disentangle the data. To use these metrics to troubleshoot jobs, the job should allocate the entire node.

The following image illustrates what the dashboard looks like in use: