Setup Overview

Below is an outline of the steps that need to be taken to setup the Jobstats platform for a Slurm cluster:

Switch to cgroup-based job accounting from Linux process accounting
Setup the exporters: cgroup, node, GPU (on the nodes) and, optionally, GPFS (centrally)
Setup the prolog.d and epilog.d scripts on the GPU nodes
Setup the Prometheus server and configure it to scrape the data from the compute nodes
Setup the slurmctldepilog.sh script for long-term job summary retention
Lastly, configure the Grafana interface and Open OnDemand

A single standard server has proven to be sufficient for a data center with 100,000 CPU-cores and 1000 GPUs.

Proceed to the next section on cgroup-based job accounting.