https://prometheus.io/

アーキテクチャ

Prometheus の役割は次の3つ：

メトリクスデータの収集・格納
クエリによるデータ整形
アラート

使ってみる

バイナリのダウンロード・起動

Prometheus

$ wget https://github.com/prometheus/prometheus/releases/download/v2.6.0/prometheus-2.6.0.linux-amd64.tar.gz
$ tar xvzf prometheus-2.6.0.linux-amd64.tar.gz
$ cd prometheus-2.6.0.linux-amd64/

$ ./prometheus --help
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.
      --config.file="prometheus.yml"  
                                 Prometheus configuration file path.
      ...

$ ./prometheus --config.file=prometheus.yml
...
level=info ts=2018-12-23T12:18:52.893777989Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090

http://hostname:9090 で Web UI にアクセスできる。

2018-12-23 21 21 11

2018-12-24 18 28 19

Node Exporter

$ wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
$ tar xvzf node_exporter-0.17.0.linux-amd64.tar.gz
$ cd node_exporter-0.17.0.linux-amd64/
$ ./node_exporter
INFO[0000] Starting node_exporter (version=0.17.0, branch=HEAD, revision=f6f6194a436b9a63d0439abc585c76b19a206b21)  source="node_exporter.go:82"
INFO[0000] Build context (go=go1.11.2, user=root@322511e06ced, date=20181130-15:51:33)  source="node_exporter.go:83"
INFO[0000] Enabled collectors:                           source="node_exporter.go:90"
INFO[0000]  - arp                                        source="node_exporter.go:97"
INFO[0000]  - bcache                                     source="node_exporter.go:97"
INFO[0000]  - bonding                                    source="node_exporter.go:97"
INFO[0000]  - conntrack                                  source="node_exporter.go:97"
INFO[0000]  - cpu                                        source="node_exporter.go:97"
INFO[0000]  - diskstats                                  source="node_exporter.go:97"
INFO[0000]  - edac                                       source="node_exporter.go:97"
INFO[0000]  - entropy                                    source="node_exporter.go:97"
INFO[0000]  - filefd                                     source="node_exporter.go:97"
INFO[0000]  - filesystem                                 source="node_exporter.go:97"
INFO[0000]  - hwmon                                      source="node_exporter.go:97"
INFO[0000]  - infiniband                                 source="node_exporter.go:97"
INFO[0000]  - ipvs                                       source="node_exporter.go:97"
INFO[0000]  - loadavg                                    source="node_exporter.go:97"
INFO[0000]  - mdadm                                      source="node_exporter.go:97"
INFO[0000]  - meminfo                                    source="node_exporter.go:97"
INFO[0000]  - netclass                                   source="node_exporter.go:97"
INFO[0000]  - netdev                                     source="node_exporter.go:97"
INFO[0000]  - netstat                                    source="node_exporter.go:97"
INFO[0000]  - nfs                                        source="node_exporter.go:97"
INFO[0000]  - nfsd                                       source="node_exporter.go:97"
INFO[0000]  - sockstat                                   source="node_exporter.go:97"
INFO[0000]  - stat                                       source="node_exporter.go:97"
INFO[0000]  - textfile                                   source="node_exporter.go:97"
INFO[0000]  - time                                       source="node_exporter.go:97"
INFO[0000]  - timex                                      source="node_exporter.go:97"
INFO[0000]  - uname                                      source="node_exporter.go:97"
INFO[0000]  - vmstat                                     source="node_exporter.go:97"
INFO[0000]  - xfs                                        source="node_exporter.go:97"
INFO[0000]  - zfs                                        source="node_exporter.go:97"
INFO[0000] Listening on :9100                            source="node_exporter.go:111"

http://hostname:9100/metrics にアクセスするとメトリクス一覧が参照できる。

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.8055e-05
go_gc_duration_seconds{quantile="0.25"} 6.3551e-05
go_gc_duration_seconds{quantile="0.5"} 7.7478e-05
...
# HELP node_cpu_guest_seconds_total Seconds the cpus spent in guests (VMs) for each mode.
# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
...

Prometheus から exporter を監視

prometheus.yml を以下のように設定する。

scrape_configs:
  ...
  - job_name: 'node'
    # Prometheus 用メトリクスデータを取得するパス。デフォルトは /metrics
    # metrics_path: /path/to/metrics
    static_configs:
    - targets: ['node001.hkawabata.jp:9100']

Prometheus Web UI でクエリを投げればグラフが見られる。

ex. 1分ごとの平均受信データ量[Bytes]：rate(node_network_receive_bytes_total[1m])

2018-12-23 22 53 48

Graphana で可視化

データソース追加：

2018-12-23 23 06 38

グラフ作成：

2018-12-23 23 04 33

「Legend Format」欄には、{{instance}} ({{device}})のように Prometheus のラベルを埋め込める。

PromQL

Prometheus 独自のクエリを使ってデータを整形できる。

データ型

データ型	説明	例
Instant Vector	各時点の単一の値を並べたベクトル	`node_filesystem_files`
Range Vector	各時点について、そこから指定した期間だけ前までさかのぼった値のリストを並べたベクトル	`node_memory_MemFree_bytes[1m]`
Scalar	単一の数値
String	単一の文字列 (ver 2.6 現在未使用)

Instant Vector のイメージ:

{
  "14:56": 0.98,
  "14:57": 0.98,
  "14:58": 0.99,
  "14:59": 0.50,
  "15:00": 0.52
}

Range Vector のイメージ:

{
  "14:58": {
    "14:56": 0.98,
    "14:57": 0.98,
    "14:58": 0.99,
  },
  "14:59": {
    "14:57": 0.98,
    "14:58": 0.99,
    "14:59": 0.50,
  },
  "15:00": {
    "14:58": 0.99,
    "14:59": 0.50,
    "15:00": 0.52
  }
}

演算子・関数

演算子・関数	説明	クエリの例	備考
`+`,`-`,`*`,`/`,`%`	四則演算、剰余計算	`( node_memory_MemTotal_bytes - node_memory_MemFree_bytes ) / node_memory_MemTotal_bytes * 100` （メモリ使用率 [%]）
`rate`	指定した期間の始点と終点から求まる、1秒あたりの増加量	`rate(node_cpu_seconds_total{mode!="idle"}[1m]) * 100` （CPU 使用率 [%]）	`mode!="idle"`により、idle 状態を除外している
`delta`	期間の始点と終点の差分（終点-始点）	`delta(node_memory_MemFree_bytes[1h])` （空きメモリの変化）
`changes`	期間中に値の変化が何回起こったかをカウント
`max`,`min`,`sum`,`avg`

Exporter

node_exporter

jmx_exporter

jmx_prometheus_httpserver

exporter 専用のプロセスを立ち上げる。

$ git clone git@github.com:prometheus/jmx_exporter.git
$ cd jmx_exporter
$ mvn package
...

必要なファイルは、

ビルド生成物のjmx_prometheus_httpserver/target/jmx_prometheus_httpserver-${version}-jar-with-dependencies.jar
設定ファイルexample_configs/httpserver_sample_config.yml

httpserver_sample_config.yml のhostPort設定はデフォルトでlocalhost:5555となっている（これは jmx_prometheus_httpserver プロセス自身の jmx ポート）。

これを監視したいプロセスの情報に書き換える。

例：

---
hostPort: localhost:1099
username: 
password: 

rules:
- pattern: ".*"

jmx_prometheus_httpserver 起動：

$ version=$(sed -n -e 's#.*<version>\(.*-SNAPSHOT\)</version>#\1#p' pom.xml)
$ jar_file=jmx_prometheus_httpserver/target/jmx_prometheus_httpserver-${version}-jar-with-dependencies.jar
$ exporter_jmx_port=5555     # jmx_prometheus_httpserver プロセス自身の jmx ポート
$ exporter_port=5556         # prometheus からのリクエストを待ち受けるポート
$ config_file=example_configs/httpserver_sample_config.yml

$ java -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.authenticate=false \
    -Dcom.sun.management.jmxremote.port=${exporter_jmx_port} \
    -jar ${jar_file} \
    ${exporter_port} \
    ${config_file}

以下は監視対象プロセス（jetty, jmx を1099ポートで有効化）と jmx_prometheus_httpserver を同じサーバで動かし、jconsole で眺めた様子。

2018-12-24 18 10 07

prometheus.yml に以下のように追記して Prometheus を再起動。

scrape_configs:
  ...
  - job_name: 'jmx_jetty'
    static_configs:
    - targets: ['localhost:5556']

Graphana で可視化：

2018-12-24 18 27 07

jmx_prometheus_javaagent

Java クライアントで exporter を自作（雑多なメモ、整理中）

quantile の計算

99%ile などの計算に関して、Summaryは監視対象のアプリケーション側で計算コストがかかるが正確、Histoguramは Prometheus サーバ側で計算するためコストは低いが正確性が犠牲になる。

ここにSummaryを使うときに役立ちそうな説明がある。

maxAgeSeconds(long): Set the duration of the time window is, i.e. how long observations are kept before they are discarded. Default is 10 minutes.

この保持期間はSummaryのビルダーで設定可能で、これを過ぎると以下のような表記になる。

# HELP requests_latency_seconds_summary Request latency in seconds (Summary).
# TYPE requests_latency_seconds_summary summary
requests_latency_seconds_summary{quantile="0.1",} NaN
requests_latency_seconds_summary{quantile="0.3",} NaN
requests_latency_seconds_summary{quantile="0.5",} NaN
requests_latency_seconds_summary{quantile="0.7",} NaN
requests_latency_seconds_summary{quantile="0.9",} NaN
requests_latency_seconds_summary_count 125.0
requests_latency_seconds_summary_sum 61.848624543999996

ageBuckets(int): Set the number of buckets used to implement the sliding time window. If your time window is 10 minutes, and you have ageBuckets=5, buckets will be switched every 2 minutes. The value is a trade-off between resources (memory and cpu for maintaining the bucket) and how smooth the time window is moved. Default value is 5.

その他メモ

（メモ）この辺りを参考にした：

https://github.com/prometheus/client_java
https://povilasv.me/prometheus-tracking-request-duration/
https://prometheus.io/docs/practices/histograms/
http://sylnsr.blogspot.com/2015/12/using-prometheus-with-java-in-jersey.html

サンプルコード：

https://github.com/hkawabata/WebApp/blob/master/jersey-practice/src/main/java/jp/hkawabata/webapp/sample/jersey/prometheus/MonitoredResource.java