一套完整的中小级别的企业级监控prometheus

摘要:
值:“{{$Value}}”-name:磁盘警报规则:-alert:磁盘利用率警报表达式:(nod

一   相信有很多博客都已经详细的说明了prometheus的作用以及相关的作用以及原理,这里不在赘述,仅仅从部署和配置2个方面来记录一下,为公司产品组搭建的prometheus告警平台的过程以及踩过的坑,废话不多说,直接开始搭建部署,需要在一台服务器上面搭建prometheus+grafana+alertmanager+pushgateway,其余被监控的节点部署node_exporter,也可以在prometheus服务端部署node_exporter

  1.1 部署prometheus,并且使用systemctl进行管控

       安装版本:prometheus-2.6.1

               百度云下载:https://pan.baidu.com/s/1w16lQZKw8PCHqlRuSK2i7A

               提取码:lw1q

     之后将包解压到: /usr/local/prometheus目录下面,建议使用ansible脚本进行部署

     这里附上安装管理的管理文件以及目录地址/usr/lib/systemd/system/prometheus.service

[Unit]
  Description=https://prometheus.io
  
  [Service]
  Restart=on-failure
  ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

  [Install]                      
  WantedBy=multi-user.target

   1.2  整理后的prometheus配置文件,添加新的监控节点job_name和机器的节点,并且节点需要安装相应的node_exporter

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.16.5.3:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "rules/first_rules.yml"
   - "rules/second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090','172.16.5.3:9100']- job_name: 'pushgateway'
    scrape_interval: 5s
    static_configs:
    - targets: ['172.16.5.3:9091']
      labels:
        instance: pushgateway

  1.3 对服务器的基础监控项如如下所示

#cat second_rules.yml
groups:
- name: 实例存活告警规则 rules: - alert: 实例存活告警 expr: up{job="prometheus"} == 0 or up{job="Linux-host"} == 0 for: 1m labels: user: prometheus severity: emergency team: HTY annotations: summary: "Instance {{ $labels.instance }} is down" description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." value: "{{ $value }}" - name: 内存告警规则 rules: - alert: "内存使用率告警" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 30 for: 1m labels: team: C3 user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} 内存报警" description: "{{ $labels.alertname }} 内存资源利用率大于30%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: 内存告警规则2 rules: - alert: "内存使用率告警2" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 50 for: 1m labels: team: C3 user: prometheus severity: critical annotations: summary: "服务器: {{$labels.alertname}} 内存报警" description: "{{ $labels.alertname }} 内存资源利用率大于50%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: CPU报警规则 rules: - alert: CPU使用率告警 expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 70 for: 1m labels: user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} CPU报警" description: "服务器: CPU使用超过70%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: 磁盘报警规则 rules: - alert: 磁盘使用率告警 expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80 for: 1m labels: user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} 磁盘报警" description: "服务器:{{$labels.alertname}},磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)" value: "{{ $value }}"

  2 安装以及配置alertmanager

global:
  # 企业微信告警配置
  resolve_timeout: 5m
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: 'ww41a2b13ef47aac58'
  wechat_api_secret: 'xxxxx'
  # qq邮箱告警配置
  smtp_from: xxx@qq.com
  smtp_auth_username: xx@qq.com
  smtp_auth_password: xxxx #需要从qq邮箱上面获取
  smtp_require_tls: false
  smtp_smarthost: 'smtp.qq.com:465'
templates:
  - "/usr/local/alertmanager/template/*.tmpl"
route:
  receiver: 'default-receiver'
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 1m
  group_by: ['team']
  routes:
  - group_by: ['test']
    group_wait: 10s
    group_interval: 30s
    repeat_interval: 1m
    receiver: 'wechat'
    match:
      team: test1
receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    message: '{{ template "wechat.default.message" .}}'
    to_party: 'xxxx'
    agent_id: "xxx"需要从企业微信上面获取
    api_secret: 'xxxxxxxx'
- name: 'default-receiver'
  email_configs:
  - to: 'xxxxxx@qq.com'
    send_resolved: true
    # html: '{{ template "wechat.default.message" .}}'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['env','team','instance','type','group','job','alertname']

  获取企业微信的方式参考这个链接:https://www.cnblogs.com/miaocbin/p/13706164.html

  获取qq邮箱参考这个链接:https://blog.csdn.net/knight_zhou/article/details/105137581 

    3 附上模版信息

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态:{{   .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{   .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}

  4. 安装以及部署grafana,推荐安装最新版的prometheus,然后使用插件,附上一个比较简洁的grafana看板

一套完整的中小级别的企业级监控prometheus第1张

   直接倒入模板,倒入步骤参考这便博客:https://www.cnblogs.com/wukc/p/14231042.html

免责声明:文章转载自《一套完整的中小级别的企业级监控prometheus》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇利用Pluggable Protocol实现浏览器打开本地应用程序javafx:JavaFX Scene Builder 2.0打开含有第三方jar包的fxml文件报错 Caused by: java.lang.ClassNotFoundException下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

cesium 中地图发生了平移,放缩,旋转等动作所要执行的动作

1、在canvas上得到鼠标点击的是那个键 <html><head><title>js判断鼠标左、中、右键哪个被点击-柯乐义</title><script type="text/javascript">function whichButton(event){var btnNum = event.b...

ORACLE 错误案例—ORA-27102: out of memory

SQL> startupORA-27102: out of memoryLinux-x86_64 Error: 28: No space left on deviceAdditional information: 2097152 [oracle@kingdee-test ~]$ cat /etc/redhat-release CentOS relea...

浅谈虚拟机、Docker和Hyper技术

操作系统 我们知道: 完整的操作系统=内核+apps 内核负责管理底层硬件资源,包括CPU、内存、磁盘等等,并向上为apps提供系统调用接口,上层apps应用必须通过系统调用方式使用硬件资源,通常并不能直接访问资源。apps就是用户直接接触的应用,比如命令行工具、图形界面工具等(linux的图形界面也是作为可选应用之一,而不像windows是集成到内核中...

prometheus监控多个MySQL实例

添加MySQL监控 添加MySQL监控主机,这里以添加10.10.20.14为例进行说明。解压exporter压缩包。 [root@localhost ~]# tar xf mysqld_exporter-0.10.0.linux-amd64.tar [root@localhost ~]# mv mysqld_exporter-0.10.0.linux-...

【FFMPEG】关于硬解码和软解码

一、一些命令 1、显示所有可用的硬件加速器 [root@tranCodeing ~]# ffmpeg -hwaccels ffmpeg version 4.1 Copyright (c) 2000-2018 the FFmpeg developers built with gcc 4.8.5 (GCC) 20150623 (Red Hat 4.8.5-...

前端用js获取本地文件的内容

这里要写成input的形式 调用upload函数 传递的参数就表示所选的文件<input type="file" onchange="upload(this)" /> //前端读取本地文件的内容 下面代码中的this.result即为获取到的内容 function upload(input) { //支持chrome IE10...