Setup Overview
Below is an outline of the steps that need to be taken to setup the Jobstats platform for a Slurm cluster:
- Switch to cgroup-based job accounting from Linux process accounting
- Setup the exporters: cgroup, node, GPU (on the nodes) and, optionally, GPFS (centrally)
- Setup the
prolog.d
andepilog.d
scripts on the GPU nodes - Setup the Prometheus server and configure it to scrape the data from the compute nodes
- Setup the
slurmctldepilog.sh
script for long-term job summary retention - Lastly, configure the Grafana interface and Open OnDemand
A single standard server has proven to be sufficient for a data center with 100,000 CPU-cores and 1000 GPUs.
Proceed to the next section on cgroup-based job accounting.