Setup Overview
Below is an outline of the setup of the Jobstats platform for a Slurm cluster:
- Switch to cgroup-based job accounting from Linux process accounting
- Setup the exporters: cgroup, node, GPU (on the nodes) and, optionally, GPFS (centrally)
- Setup the
prolog.dandepilog.dscripts on the GPU nodes - Setup the Prometheus server and configure it to scrape the data from the compute nodes
- Setup the
slurmctldepilog.shscript for long-term job summary retention - Lastly, configure the Grafana interface and Open OnDemand
A single standard server has proven to be sufficient for a data center with 150,000 CPU-cores and 1000 GPUs.
Proceed to the next section on cgroup-based job accounting.