GPU Job Statistics

GPU metrics (currently only NVIDIA) are collected by the Jobstats GPU exporter which was based on the exporter by Rohit Agarwal [1]. The main local changes were to add the handling of Multi-Instance GPUs (MIG) and two additional gauge metrics: nvidia_gpu_jobId and nvidia_gpu_jobUid. The table below lists all of the collected GPU fields.

Name	Description	Type
`nvidia_gpu_duty_cycle`	GPU utilization	gauge
`nvidia_gpu_memory_total_bytes`	Total memory of the GPU device in bytes	gauge
`nvidia_gpu_memory_used_bytes`	Memory used by the GPU device in bytes	gauge
`nvidia_gpu_num_devices`	Number of GPU devices gauge	gauge
`nvidia_gpu_power_usage_milliwatts`	Power usage of the GPU device in milliwatts	gauge
`nvidia_gpu_temperature_celsius`	Temperature of the GPU device in Celsius	gauge
`nvidia_gpu_jobId`	JobId number of a job currently using this GPU as reported by Slurm	gauge
`nvidia_gpu_jobUid`	UID number of user running jobs on this GPU	gauge

Note

Note that the approach described here is not appropriate for clusters that allow for GPU sharing (e.g., sharding).

GPU Job Ownership Helper

In order to correctly track which GPU is assigned to which jobid, we use Slurm prolog and epilog scripts to create files in /run/gpustat. These files are named either after GPU ordinal number (0, 1, ...) or, in the case of Multi-Instance GPUs (MIG), MIG-UUID. These files contain the space-separated jobid and UID number of the user, for example:

$ cat /run/gpustat/MIG-265a219d-a49f-578a-825d-222c72699c16
45916256 262563

These two scripts can be found in the slurm directory of the Jobstats GitHub repository. For example, slurm/epilog.d/gpustats_helper.sh could be installed as /etc/slurm/epilog.d/gpustats_helper.sh and slurm/prolog.d/gpustats_helper.sh as /etc/slurm/prolog.d/gpustats_helper.sh with these slurm.conf statements:

Prolog=/etc/slurm/prolog.d/*.sh
Epilog=/etc/slurm/epilog.d/*.sh

For efficiency and simplicity, JobId and jobUid are collected from files in /run/gpustat/0 (for GPU 0), /run/gpustat/1 (for GPU 1), and so on. For example:

$ cat /run/gpustat/0
247609 223456

In the above, the first number is the jobid and the second is the UID number of the owning user. These are created with Slurm prolog.d and epilog.d scripts that can be found in the Jobstats GitHub repository.