Prometheus
Prometheus is a monitoring system and time series database. For setup, follow the directions at prometheus.io. The four Prometheus exporters required by the Jobstats platform were discussed in the previous sections.
Basic Prometheus Configuration
What follows is an example of production configuration used for the Tiger cluster that has both CPU and GPU nodes:
---
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: master
- job_name: Tiger Nodes
scrape_interval: 30s
scrape_timeout: 30s
file_sd_configs:
- files:
- "/etc/prometheus/local_files_sd_config.d/tigernodes.json"
metric_relabel_configs:
- source_labels:
- __name__
regex: "^go_.*"
action: drop
- job_name: TigerGPU Nodes
scrape_interval: 30s
scrape_timeout: 30s
file_sd_configs:
- files:
- "/etc/prometheus/local_files_sd_config.d/tigergpus.json"
metric_relabel_configs:
- source_labels:
- __name__
regex: "^go_.*"
action: drop
The tigernode.json
file looks like:
[
{
"labels": {
"cluster": "tiger"
},
"targets": [
"tiger-h19c1n10:9100",
"tiger-h19c1n10:9306",
...
]
}
]
Both node_exporter
(port 9100) and cgroup_exporter
(port 9306) are listed for all of the nodes in tigernode.json
. The file tigergpus.json
looks very similar except that it collects data from nvidia_gpu_prometheus_exporter
on port 9445. Note the additional cluster
label.