2021-04-27

DevOps

prometheus学习笔记

[toc]

参考资料

微信读书《prometheus 监控实战》 https://weread.qq.com/web/reader/4ca32c50718f639f4ca492bk68d3221025468d30a95982e

重要：青牛踏雪prometheus中文文档 https://www.prometheus.wang/

Prometheus

1.1 监控是什么

技术监控-面向开发者

业务监控-面向应用程序/用户

1.3 监控机制

形式：探针和内省
方式：pull 和 push
内容：指标和日志

1.4 指标

prometheus 以指标为中心

是什么指标是观察点+颗粒度 -> 时间序列
类型测量型、计数型、直方图
指标的聚合单个关注个体，多个整合数据
摘要平均数、中位数、百分数、方差

观察点=“我”的“身高”
时间序列=我的身高在”时间刻度“上的变话

指标是”可度量特征“例如身高、体重

标签集是用来标定提供被衡量指标的主题。（衡量叫xx的，身份证号是xx的，就可以精确标定一个主体）

xx的身高随年的变化，就是一个时间序列。

时间序列:

^

│ . . . . . . . . . . . . . . . . . . . node_cpu{cpu=”cpu0”,mode=”idle”}

│ . . . . . . . . . . . . . . . . . . . node_cpu{cpu=”cpu0”,mode=”system”}

│ . . . . . . . . . . . . . . . . . . node_load1{}

│ . . . . . . . . . . . . . . . . . .
v
<—————— 时间 —————->

在time-series中的每一个点称为一个观察点/样本（sample），样本由以下三部分组成：

指标(metric)：metric name和描述当前样本特征的labelsets;
时间戳(timestamp)：一个精确到毫秒的时间戳;
样本值(value)：一个folat64的浮点型数据表示当前样本的值。

<--------------- metric ---------------------><-timestamp -><-value->
http_request_total{status="200", method="GET"}@1434417560938 => 94355
http_request_total{status="200", method="GET"}@1434417561287 => 94334
http_request_total{status="404", method="GET"}@1434417560938 => 38473
http_request_total{status="404", method="GET"}@1434417561287 => 38544
http_request_total{status="200", method="POST"}@1434417560938 => 4748
http_request_total{status="200", method="POST"}@1434417561287 => 4785

1.5 方法论

侧重主机 use法则针对每个资源使用率、饱和度、错误测量

侧重应用程序 google黄金4指标延迟、流量、错误、饱和度

1.6 警报和通知可视化

警报和通知是给人看的，应包含关键信息

可视化便于组织信息

2、prometheus

2.2 架构

指标收集

端点（endpoint）是可抓取信息来源
目标（target）执行抓取的信息
一组目标是作业（job）
广义上讲所有可以向Prometheus提供监控样本数据的程序都可以被称为一个Exporter。而Exporter的一个实例称为target

服务发现

用户提供列表
配置管理工具 prometheus 配置文件
consul 自动发现

聚合和警报

查询数据

promQL

自治

冗余和高可用

可视化

grafana

2.3 数据模型

指标名称

标签

时间序列值

符号表示

total_web_site{site=”testApp”,instance=”webserver”,job=”web”}

2.6 参考链接

2.6 参考链接
·Prometheus官网：https://prometheus.io/。

·Prometheus文档：https://prometheus.io/docs/。

·Prometheus GitHub主页：https://github.com/prometheus/。

·Prometheus GitHub源码：https://github.com/prometheus/prometheus。

·Prometheus参考视频：大规模Prometheus和时间序列设计（https://www.youtube.com/watch?v=gNmWzkGViAY）。

·Grafana官网：https://grafana.com/。

3、实战

3.1 安装

mac 本地：brew install prometheus

docker ： docker pull prom/prometheus

docker run -p 9090:9090 –name prometheus -d -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

prometheus.yml :

global:
  scrape_interval:     15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

docker run -p 9090:9090 -d –name prom -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

3.2 获取监控指标

localhost:9090/metrics

localhost:9090/graph PromQL

sum(rate(promhttp_metric_handler_requests_total[5m]))by(job)

以job标签分组来查询 http请求 5min时间序列内的平均增长率

4、监控主机和容器

4.1 node exporter

制作一个node_exporter镜像,包括获取、解压、执行node_exporter 在默认的9100端口，并将该端口暴露出来

(同时该镜像包括安装weget 和 ping)

Dockerfile:

FROM ubuntu:18.04
MAINTAINER James Turnbull "james@example.com"
RUN apt-get -qq update && apt-get -qq install wget
RUN apt-get qq install iputils-ping
WORKDIR /opt
RUN wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz
RUN tar -xzf node_exporter-*
RUN cp node_exporter-*/node_exporter /usr/local/bin/

EXPOSE 9100

CMD ["node_exporter"]

docker build -t my/node_exporter . 通过Dockerfile制作镜像

启动两个node_exporter

docker run -p 9100:9100 –name node_export1 -d my/node_exporter

修改promtheus.yml配置，使其加入两个node的target

docker stop prom

docker rm prom

prometheus抓取node_exporter

global:
  scrape_interval:     15s //抓取之间间隔
  evaluation_interval: 15s //计算指标间隔

alerting:
  alertmanagers: 
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "rules/node_rules.yml"

scrape_configs:
- job_name: 'prometheus' 
  static_configs:
    - targets: ['localhost:9090']

- job_name: 'node' //这里标识，通过制作好的node_exporter镜像，启动了两个容器
  static_configs:   // 进入到容器里获取ip
    - targets: ['172.17.0.3:9100', '172.17.0.4:9100']

- job_name: 'docker' //这里的配置是抓取cadvisor的数据，cadvisor作为一个容器运行
  static_configs:
    - targets: ['172.17.0.5:8080']
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/docker/([a-z0-9]+)'
    replacement: '$1'
    target_label: container_id
  - source_labels: [__name__]
    separator: ','
    regex: '(container_tasks_state|container_memory_failures_total)'
    action: drop

docker run -p 9090:9090 -d –name prom –link node_exporter1:node1 –link node_exporter2:node2 -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

4.2 cAdvisor

cadvisor 作为Docker容器运行，可以收集Docker本身以及所有Docker容器的信息提供给prometheus

运行cAdvisor

docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:latest

localhost:8080/containers

localhost:8080/docker 查看所有已运行的docker 镜像、容器

localhost:8080/metrics 获取所有指标

4.3 抓取数据

抓取的生命周期

在每个抓取周期（scrape_interval定义）里，prometheus会检查执行的作业（job），作业会生成一个目标（target）列表。（作业生成目标列表的过程，被称为服务发现）

服务发现返回一个目标列表，包含一组元数据（meta data）

服务发现也会根据目标的配置，默认的生成一些标签.例如_metrics_path_，这些带有下划线的标签不会显示在页面UI上，部分可以被覆盖，例如可以再抓取配置里，设置metrics_path关键字

生命周期：
服务发现-> 配置 -> 重新标记（relabale_configs) -> 抓取 -> 重新标记（metric_relable_configs)

4.4 标签

标签提供了时间序列的维度，定义了目标。标签和指标名称结合，作为时间序列的标识。

4.4.1 标签分类

1 拓扑标签 2 模式标签

4.4.2 重新标记

重新标记目的：删除、添加、修改标签

两个阶段：

1、对来自服务发现的目标重新标记 -> 元数据应用到标签

2、抓取之后指标保存到系统前，指定样式，丢弃指标

抓取之前用relabel_configs，抓取之后metric_relabel_configs

4.4.3 实例

//prometheus.yml metric_relabel_configs 配置示例
- job_name: 'docker' //这里的配置是抓取cadvisor的数据，cadvisor作为一个容器运行
  static_configs:
    - targets: ['172.17.0.5:8080']
  metric_relabel_configs:
  
  - source_labels: [id]
  //这个动作会检索标签为id的标签，然后去掉id的/docker/docker
  //执行该动作前标签：id="/docker/abcd..."
  //执行后标签为：abcd...
    regex: '/docker/([a-z0-9]+)'
    replacement: '$1' //替换标签值，因为替换是默认动作，不需要像删除指定action
    target_label: container_id
    
    
  - source_labels: [__name__] 
  //source_labels指定要操作的标签
  //这里的__name__标签是指标名称的预留（默认）标签
  //通过regex，以正则形式，筛选__name__标签符合regex内容的标签
  //然后执行drop 删除指标操作
  //这里的separator，指如果source_labels里有多个值，用什么区分。
    separator: ','
    regex: '(container_tasks_state|container_memory_failures_total)'
    action: drop //删除标签指
    
  - regex: 'kernelVersion'
    action: labeldrop //该动作会删除所有匹配'kernelVersion'的标签

node_exporter advisor 指标

cpu mode=”idle” cpu的闲置状态。

4.6 查询持久化

在prometheus.yml中，可以指定记录规则文件位置

完整的prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "rules/node_rules.yml"

scrape_configs:
- job_name: 'prometheus'
  static_configs:
    - targets: ['localhost:9090']

- job_name: 'node'
  static_configs:
    - targets: ['138.197.26.39:9100', '138.197.30.147:9100', '138.197.30.163:9100']

- job_name: 'docker'
  static_configs:
    - targets: ['138.197.26.39:8080', '138.197.30.147:8080', '138.197.30.163:8080']
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/docker/([a-z0-9]+)'
    replacement: '$1'
    target_label: container_id
  - source_labels: [__name__]
    separator: ','
    regex: '(container_tasks_state|container_memory_failures_total)'
    action: drop

node_rules.yml:

groups:
- name: node_rules
  rules:
  - record: instance:node_cpu:avg_rate5m
    expr: 100 - avg (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) by (instance) * 100
  - record: instance:node_cpus:count
    expr: count by (instance)(node_cpu_seconds_total{mode="idle"})
  - record: instance:node_cpu_saturation_load1
    expr: node_load1 > on (instance) 2 * count by (instance)(node_cpu_seconds_total{mode="idle"})
  - record: instance:node_memory_usage:percentage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree + node_memory_Cached_bytes + node_memory_Buffers_bytes)) / node_memory_MemTotal_bytes * 100
  - record: instance:node_memory_swap_io_bytes:sum_rate
    expr: 1024 * sum by (instance) (
                 (rate(node_vmstat_pgpgin[1m])
                 + rate(node_vmstat_pgpgout[1m]))
          )
  - record: instance:root:node_filesystem_usage:percentage
    expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100

然后会生成一个新的时间序列，对应着之前的CPU负载率表达式：instance:node_cpu:avg_rage5m

4.7 grafana 可视化

brew install granafa
brew service start granafa

·Grafana入门：http://docs.grafana.org/guides/getting_started/。

·Grafana教程和录像：http://docs.grafana.org/tutorials/screencasts/。

·Prometheus文档中的Grafana部分：https://prometheus.io/docs/visualization/grafana/。

·Grafana预建仪表板：https://grafana.com/dashboards

5 服务发现

目的：在prometheus.yml的static_configs块里手动定义服务发现的方式，不够灵活。

5.1 基于文件服务发现

借助基于文件的服务发现，Prometheus会使用文件中指定的目标。这些文件通常由另一个系统生成，例如Puppet、++Ansible++或Chef等配置管理系统，或者从其他源（如CMDB）查询。定期执行脚本或进行查询可以（重新）生成这些文件。Prometheus会按指定的时间计划从这些文件重新加载目标。

修改prometheus.yml,将作业指向服务发现的文件位置

- job_name: node
  file_sd_configs:
    - files:
      - targets/nodes/*.json
      refresh_interval: 5m

- job_name: docker
  file_sd_configs:
    - files:
      - targets/docker/*.json
      refresh_interval: 5m
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/docker/([a-z0-9]+)'
    replacement: '$1'
    target_label: container_id
  - source_labels: [__name__]
    separator: ','
    regex: '(container_tasks_state|container_memory_failures_total)'
    action: drop

/targets/nodes/node.json （targets 目录在prometheus 目录下）
node.json

1
2
3

[{
	"targets": ["138.197.26.39:9100", "138.197.30.147:9100", "138.197.30.163:9100"]
}]

/targets/docker/docker.json

1
2
3

[{
	"targets": ["138.197.26.39:8080", "138.197.30.147:8080", "138.197.30.163:8080"]
}]

5.2 基于API的服务发现

例如consul

5.3 基于DNS的服务费发现

Exporter

Prometheus Server并不直接服务监控特定的目标，其主要任务负责数据的收集，存储并且对外提供数据查询支持。

Exporter可以是一个相对开放的概念，其可以是一个独立运行的程序独立于监控目标以外，也可以是直接内置在监控目标中。只要能够向Prometheus提供标准格式的监控样本数据即可

PromQL

Metric 类型

1、counter 计数器只增不减
cpu 在idle mode下累计耗费的事件

1
2
3

# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="idle"} 362812.7890625

2、gauge 测量型/仪表盘

node_memory_MemFree（主机当前空闲的内容大小）

Gauge类型的监控指标，通过PromQL内置函数delta()可以获取样本在一段时间返回内的变化情况。例如，计算CPU温度在两个小时内的差异

delta(cpu_temp_celsius{host=”zeus”}[2h])

3、summary 摘要

# HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of WAL fsync.
# TYPE prometheus_tsdb_wal_fsync_duration_seconds summary
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.5"} 0.012352463
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.9"} 0.014458005
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.99"} 0.017316173
prometheus_tsdb_wal_fsync_duration_seconds_sum 2.888716127000002
prometheus_tsdb_wal_fsync_duration_seconds_count 216

4、histogram 直方图
直接反映当前指标记录的总数

# HELP prometheus_tsdb_compaction_chunk_range Final time range of chunks on their first compaction
# TYPE prometheus_tsdb_compaction_chunk_range histogram
prometheus_tsdb_compaction_chunk_range_bucket{le="409600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="1.6384e+06"} 260
prometheus_tsdb_compaction_chunk_range_bucket{le="6.5536e+06"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="2.62144e+07"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="+Inf"} 780
prometheus_tsdb_compaction_chunk_range_sum 1.1540798e+09
prometheus_tsdb_compaction_chunk_range_count 780

参考：

青牛踏雪普罗米修斯中文文档Grafana部分 https://www.prometheus.wang/grafana/

Granafana

Grafana 基本概念

数据源（Data Source）：

对于Granafa，像Prometheus这种可以为其提供数据的对象，统称为数据源。

仪表盘（Dashboard）：

Dashboard是由Panl（面板）聚合而成的可视化数据展示区域。Grafana通过插件的形式，提供了多种pannel，例如Graph Pannel，Heatmap Pannel

面板（Panel）

面板是一个Dashboard中的最基本可视化单元。每个pannel的数据源是可以不同的。Grafana通过对基于数据源的查询语句，构建起可视化数据。 Panel中存在Row的概念，可以组织和管理一组线管的Panel

模板参数（Templating variables）

通过指定模板参数，可以动态的实现数据的可视化。

Grafana数据可视化（三种Panel）

Grafana数据可视化，是通过Panel来体现的，Panel作为最基础的可视化组件，具有Graph、Heatmap、SingleStat三种形式

这一部分组主要讲了怎么在Grafana中创建Panel、改变Pannel样式等。

1、Graph Panel

Graph是通过折线图、柱状图的形式，显示监控样本随时间的变化趋势（时间序列）。

Graph适合Prometheus中的Gauge（测量性）和Counter（计数型）的监控数据可视化。

Graph方便多个数据对比

参考3.2.3 + 3.2.3.1 https://www.prometheus.wang/grafana/grafana-panels.html

2、Heatmap Panel

热力图

3、SingleStat

主要是对于某个指标计数

参考3.2.3.3 https://www.prometheus.wang/grafana/use_singlestat_panel.html