GPU Job Statistics
GPU metrics (currently only NVIDIA) are collected by the Jobstats GPU exporter which was based on the exporter by Rohit Agarwal [1]. The main local changes were to add the handling of Multi-Instance GPUs (MIG) and two additional gauge metrics: nvidia_gpu_jobId
and nvidia_gpu_jobUid
. The table below lists all of the collected GPU fields.
Name | Description | Type |
---|---|---|
nvidia_gpu_duty_cycle |
GPU utilization | gauge |
nvidia_gpu_memory_total_bytes |
Total memory of the GPU device in bytes | gauge |
nvidia_gpu_memory_used_bytes |
Memory used by the GPU device in bytes | gauge |
nvidia_gpu_num_devices |
Number of GPU devices gauge | gauge |
nvidia_gpu_power_usage_milliwatts |
Power usage of the GPU device in milliwatts | gauge |
nvidia_gpu_temperature_celsius |
Temperature of the GPU device in Celsius | gauge |
nvidia_gpu_jobId |
JobId number of a job currently using this GPU as reported by Slurm | gauge |
nvidia_gpu_jobUid |
UID number of user running jobs on this GPU | gauge |
Note
Note that the approach described here is not appropriate for clusters that allow for GPU sharing (e.g., sharding).
GPU Job Ownership Helper
In order to correctly track which GPU is assigned to which jobid
, we use Slurm prolog and epilog scripts to create files in /run/gpustat
. These files are named either after GPU ordinal number (0, 1, ...) or, in the case of Multi-Instance GPUs (MIG), MIG-UUID. These files contain the space-separated jobid
and UID
number of the user, for example:
These two scripts can be found in the slurm
directory of the Jobstats GitHub repository. For example, slurm/epilog.d/gpustats_helper.sh
could be installed as /etc/slurm/epilog.d/gpustats_helper.sh
and slurm/prolog.d/gpustats_helper.sh
as /etc/slurm/prolog.d/gpustats_helper.sh
with these slurm.conf
statements:
For efficiency and simplicity, JobId
and jobUid
are collected from files in /run/gpustat/0
(for GPU 0), /run/gpustat/1
(for GPU 1), and so on. For example:
In the above, the first number is the jobid
and the second is the UID
number of the owning user. These are created with Slurm prolog.d
and epilog.d
scripts that can be found in the Jobstats GitHub repository.