企业级监控系统架构实战:Zabbix、Prometheus+Grafana多维度监控设计

一、企业级监控系统概述

1.1 监控系统的价值

监控系统是现代IT架构的”眼睛”,它能够:

  • 实时感知系统健康状态:及时发现性能瓶颈和故障
  • 预测性运维:通过趋势分析预测潜在问题
  • 决策支持:为容量规划和架构优化提供数据支撑
  • 故障快速定位:缩短MTTR(平均修复时间)

1.2 监控系统分类

传统监控方案:Zabbix

优势

  • 成熟稳定,功能全面
  • 开箱即用,配置相对简单
  • 支持丰富的可视化模板
  • 告警功能强大

适用场景

  • 传统IT基础设施监控
  • 主机、网络设备监控
  • 注重稳定性

云原生监控方案:Prometheus+Grafana

优势

  • 基于时序数据的强大查询语言
  • 适合云原生、容器化环境
  • 扩展性强,生态丰富
  • 与Kubernetes深度集成

适用场景

  • 容器化应用监控
  • 微服务架构监控
  • 大规模分布式系统

二、Zabbix监控架构设计

2.1 Zabbix架构组件

Zabbix采用经典的分层架构:

  • Zabbix Agent:部署在被监控主机上,采集数据
  • Zabbix Server:核心服务,负责数据收集、处理、告警
  • Zabbix Database:存储配置、历史数据
  • Zabbix Web:Web界面,提供可视化

2.2 Zabbix监控项配置

CPU监控指标

1
2
# Zabbix Agent配置示例
UserParameter=cpu.utilization[*],top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}'

关键监控指标

  • 整体CPU使用率
  • 各核心CPU使用率
  • CPU负载(Load Average)
  • 中断和上下文切换次数

内存监控指标

1
2
3
4
5
# 内存监控配置
UserParameter=memory.total,free -m | awk '/^Mem:/{print $2}'
UserParameter=memory.used,free -m | awk '/^Mem:/{print $3}'
UserParameter=memory.free,free -m | awk '/^Mem:/{print $4}'
UserParameter=memory.available,free -m | awk '/^Mem:/{print $7}'

关键监控指标

  • 总内存、已用内存、可用内存
  • 交换分区使用情况
  • Buffer和Cache使用情况
  • 内存使用率趋势

磁盘监控指标

1
2
3
4
5
6
# 磁盘监控配置
UserParameter=disk.total[*],df -h $1 | tail -1 | awk '{print $2}'
UserParameter=disk.used[*],df -h $1 | tail -1 | awk '{print $3}'
UserParameter=disk.available[*],df -h $1 | tail -1 | awk '{print $4}'
UserParameter=disk.io.read[*],iostat -x 1 2 | grep $1 | tail -1 | awk '{print $6}'
UserParameter=disk.io.write[*],iostat -x 1 2 | grep $1 | tail -1 | awk '{print $7}'

关键监控指标

  • 磁盘空间使用率
  • 磁盘I/O读写量
  • 磁盘I/O等待时间
  • INode使用情况

2.3 Zabbix模板配置

Linux系统模板配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
{
"name": "Linux Server",
"items": [
{
"name": "CPU utilization",
"key": "system.cpu.util",
"type": "0"
},
{
"name": "Memory utilization",
"key": "vm.memory.size[pused]",
"type": "0"
},
{
"name": "Disk space usage",
"key": "vfs.fs.size[/,pused]",
"type": "0"
},
{
"name": "Disk I/O read",
"key": "vfs.dev.read[/dev/sda]",
"type": "0"
},
{
"name": "Disk I/O write",
"key": "vfs.dev.write[/dev/sda]",
"type": "0"
}
],
"triggers": [
{
"name": "High CPU usage",
"expression": "{Linux Server:system.cpu.util.avg(5m)}>80"
},
{
"name": "High memory usage",
"expression": "{Linux Server:vm.memory.size[pused].avg(5m)}>90"
},
{
"name": "Low disk space",
"expression": "{Linux Server:vfs.fs.size[/,pused].last()}>80"
}
]
}

2.4 Zabbix告警配置

告警动作配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 告警动作示例
Actions:
- name: "Critical System Alerts"
conditions:
- host: Linux Server
severity: High
operations:
- type: send-message
to: ops-team@company.com
subject: "[ALERT] {TRIGGER.NAME}"
message: |
Alert: {TRIGGER.NAME}
Host: {HOST.NAME}
Status: {TRIGGER.STATUS}
Time: {EVENT.DATE} {EVENT.TIME}
Value: {ITEM.VALUE}
- type: send-sms
to: ["+1234567890"]
- type: execute-remote-command
command: /opt/scripts/auto-heal.sh

三、Prometheus监控架构设计

3.1 Prometheus核心概念

Prometheus数据模型

  • Metric:指标名称和标签键值对
  • Sample:时序数据点(时间戳+值)
  • Exporter:指标采集器
  • Scrape:定期拉取指标数据

3.2 Prometheus配置

prometheus.yml核心配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Prometheus配置文件
global:
scrape_interval: 15s # 默认抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels:
cluster: 'production'
region: 'cn-beijing'

# 告警规则配置
rule_files:
- "/etc/prometheus/rules/*.yml"

# 抓取配置
scrape_configs:
# Job: 监控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

# Job: 监控Node Exporter
- job_name: 'node-exporter'
scrape_interval: 10s
static_configs:
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
- '192.168.1.12:9100'
labels:
env: 'production'
role: 'application'

# Job: 监控MySQL
- job_name: 'mysql'
static_configs:
- targets: ['192.168.1.20:9104']
labels:
database: 'main-db'

# Job: 监控Redis
- job_name: 'redis'
static_configs:
- targets: ['192.168.1.30:9121']
labels:
cache: 'session-cache'

# Job: 监控Kafka
- job_name: 'kafka'
static_configs:
- targets: ['192.168.1.40:9308']
labels:
queue: 'message-queue'

# Job: 监控应用服务
- job_name: 'springboot-app'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- '192.168.1.100:8080'
- '192.168.1.101:8080'
- '192.168.1.102:8080'
labels:
app: 'user-service'
version: 'v1.0'

# Job: 黑盒监控 - 服务存活检测
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://www.example.com
- http://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.1.50:9115

# Alertmanager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'

# 远程存储配置
remote_write:
- url: "http://thanos-receiver:10908/api/v1/receive"
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 10000

3.3 Node Exporter监控指标

系统指标采集

1
2
3
4
5
6
7
8
9
10
11
# 安装Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter

# 常用收集器配置
./node_exporter \
--collector.systemd \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)($$|/)"

Node Exporter提供的关键指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# CPU指标
node_cpu_seconds_total{cpu="0",mode="idle"}
node_cpu_seconds_total{cpu="0",mode="user"}
node_cpu_seconds_total{cpu="0",mode="system"}

# 内存指标
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes

# 磁盘指标
node_disk_read_bytes_total{device="sda"}
node_disk_write_bytes_total{device="sda"}
node_disk_io_time_seconds_total{device="sda"}
node_disk_read_io_seconds_total{device="sda"}
node_disk_write_io_seconds_total{device="sda"}

# 文件系统指标
node_filesystem_size_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"}
node_filesystem_avail_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"}
node_filesystem_free_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"}

# 网络指标
node_network_receive_bytes_total{device="eth0"}
node_network_transmit_bytes_total{device="eth0"}
node_network_receive_packets_total{device="eth0"}
node_network_transmit_packets_total{device="eth0"}

# Load指标
node_load1
node_load5
node_load15

3.4 PromQL查询示例

CPU使用率查询

1
2
3
4
5
6
7
8
9
10
11
# CPU使用率(百分比)
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 单核CPU使用率
100 - (rate(node_cpu_seconds_total{cpu="0",mode="idle"}[5m]) * 100)

# 各核心CPU使用率统计
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (cpu)

# 应用进程CPU使用率
rate(process_cpu_seconds_total[5m]) * 100

内存使用率查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 总内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 内存已使用量(GB)
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

# 缓冲区使用率
node_memory_Buffers_bytes / node_memory_MemTotal_bytes * 100

# 缓存使用率
node_memory_Cached_bytes / node_memory_MemTotal_bytes * 100

# 交换分区使用率
(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100

磁盘使用率查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 磁盘使用率(百分比)
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# 磁盘剩余空间(GB)
node_filesystem_avail_bytes{mountpoint="/"} / 1024 / 1024 / 1024

# 磁盘I/O使用率
rate(node_disk_io_time_seconds_total{device="sda"}[5m]) / 10 * 100

# 磁盘读取速度(MB/s)
rate(node_disk_read_bytes_total{device="sda"}[5m]) / 1024 / 1024

# 磁盘写入速度(MB/s)
rate(node_disk_write_bytes_total{device="sda"}[5m]) / 1024 / 1024

服务存活监控查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# HTTP服务存活检测
probe_success{job="blackbox"}

# 服务响应时间
probe_http_duration_seconds{job="blackbox"}

# 服务状态码
probe_http_status_code{job="blackbox"}

# TCP连接存活检测
probe_success{job="tcp-probe"}

# 服务可用性百分比(最近5分钟)
avg_over_time(probe_success{job="blackbox"}[5m]) * 100

3.5 Prometheus告警规则

告警规则文件配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# alerts.yml - 告警规则配置
groups:
- name: system_alerts
interval: 30s
rules:
# CPU告警
- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "主机 {{ $labels.instance }} CPU使用率超过80% (当前: {{ $value }}%)"

- alert: CriticalCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "CPU使用率严重过高"
description: "主机 {{ $labels.instance }} CPU使用率超过95% (当前: {{ $value }}%)"

# 内存告警
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "主机 {{ $labels.instance }} 内存使用率超过85% (当前: {{ $value }}%)"

- alert: CriticalMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "内存使用率严重过高"
description: "主机 {{ $labels.instance }} 内存使用率超过95% (当前: {{ $value }}%)"

# 磁盘告警
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘使用率过高"
description: "主机 {{ $labels.instance }} 磁盘使用率超过80% (当前: {{ $value }}%)"

- alert: CriticalDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "磁盘使用率严重过高"
description: "主机 {{ $labels.instance }} 磁盘使用率超过90% (当前: {{ $value }}%)"

# 磁盘I/O告警
- alert: HighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) / 10 * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘I/O使用率过高"
description: "主机 {{ $labels.instance }} 磁盘 {{ $labels.device }} I/O使用率超过80%"

# Load告警
- alert: HighLoadAverage
expr: node_load1 / count(count(node_cpu_seconds_total) by (cpu)) > 1.5
for: 5m
labels:
severity: warning
annotations:
summary: "系统负载过高"
description: "主机 {{ $labels.instance }} Load1分钟平均负载过高 (当前: {{ $value }})"

- name: service_alerts
interval: 30s
rules:
# 服务存活告警
- alert: ServiceDown
expr: probe_success{job="blackbox"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用"
description: "服务 {{ $labels.instance }} 不可用,探活检测失败"

- alert: ServiceSlow
expr: probe_http_duration_seconds{job="blackbox"} > 3
for: 5m
labels:
severity: warning
annotations:
summary: "服务响应缓慢"
description: "服务 {{ $labels.instance }} 响应时间超过3秒 (当前: {{ $value }}s)"

# HTTP状态码告警
- alert: ServiceHTTPError
expr: probe_http_status_code{job="blackbox"} != 200
for: 1m
labels:
severity: critical
annotations:
summary: "服务返回错误码"
description: "服务 {{ $labels.instance }} 返回HTTP状态码 {{ $value }}"

# TCP连接告警
- alert: TCPConnectionFailed
expr: probe_success{job="tcp-probe"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "TCP连接失败"
description: "TCP服务 {{ $labels.instance }} 连接失败"

3.6 Alertmanager配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# alertmanager.yml - 告警管理器配置
global:
resolve_timeout: 5m
# 邮件配置
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: 'monitor@example.com'
smtp_auth_username: 'monitor@example.com'
smtp_auth_password: 'your_password'

# 路由配置
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h

routes:
# 严重级别告警直接发送
- match:
severity: critical
receiver: 'critical-alert'
group_wait: 0s
continue: true

# 系统告警路由
- match:
alertname: HighCPUUsage
receiver: 'system-team'
group_by: ['instance']

# 服务告警路由
- match:
alertname: ServiceDown
receiver: 'ops-team'
continue: true

# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'instance']

# 接收者配置
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops-team@example.com'
headers:
Subject: '{{ .GroupLabels.alertname }}'
webhook_configs:
- url: 'http://webhook:8080/alert'
send_resolved: true

- name: 'critical-alert'
email_configs:
- to: 'ops-manager@example.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
# 钉钉通知
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
send_resolved: true
http_config:
basic_auth:
username: 'monitor'
# 企业微信通知
wechat_configs:
- corp_id: 'xxx'
api_secret: 'xxx'
api_url: 'https://qyapi.weixin.qq.com'
to_user: '@all'
agent_id: 'xxx'
send_resolved: true

- name: 'system-team'
email_configs:
- to: 'system-team@example.com'

- name: 'ops-team'
email_configs:
- to: 'ops-team@example.com'
sms_configs:
- to: '13800138000'

四、Grafana可视化仪表板

4.1 Grafana仪表板配置

CPU监控仪表板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"dashboard": {
"title": "CPU监控面板",
"panels": [
{
"title": "CPU使用率",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU使用率"
}
],
"type": "graph"
},
{
"title": "各核心CPU使用率",
"targets": [
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\"}[5m])) by (cpu)",
"legendFormat": "CPU {{cpu}}"
}
],
"type": "graph"
},
{
"title": "系统负载",
"targets": [
{
"expr": "node_load1",
"legendFormat": "Load 1min"
},
{
"expr": "node_load5",
"legendFormat": "Load 5min"
},
{
"expr": "node_load15",
"legendFormat": "Load 15min"
}
],
"type": "graph"
}
]
}
}

内存监控仪表板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
{
"dashboard": {
"title": "内存监控面板",
"panels": [
{
"title": "内存使用率",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "内存使用率"
}
],
"type": "gauge",
"thresholds": [
{"value": 0, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 85, "color": "red"}
]
},
{
"title": "内存详细使用",
"targets": [
{
"expr": "node_memory_MemTotal_bytes - node_memory_MemFree_bytes",
"legendFormat": "已使用"
},
{
"expr": "node_memory_Buffers_bytes",
"legendFormat": "Buffers"
},
{
"expr": "node_memory_Cached_bytes",
"legendFormat": "Cached"
},
{
"expr": "node_memory_MemFree_bytes",
"legendFormat": "空闲"
}
],
"type": "piechart"
},
{
"title": "交换分区使用",
"targets": [
{
"expr": "(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100",
"legendFormat": "Swap使用率"
}
],
"type": "graph"
}
]
}
}

磁盘监控仪表板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"dashboard": {
"title": "磁盘监控面板",
"panels": [
{
"title": "磁盘使用率",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"})) * 100",
"legendFormat": "根分区"
}
],
"type": "gauge"
},
{
"title": "磁盘I/O读写速度",
"targets": [
{
"expr": "rate(node_disk_read_bytes_total{device=\"sda\"}[5m]) / 1024 / 1024",
"legendFormat": "读取速度 MB/s"
},
{
"expr": "rate(node_disk_write_bytes_total{device=\"sda\"}[5m]) / 1024 / 1024",
"legendFormat": "写入速度 MB/s"
}
],
"type": "graph"
},
{
"title": "磁盘I/O使用率",
"targets": [
{
"expr": "rate(node_disk_io_time_seconds_total{device=\"sda\"}[5m]) / 10 * 100",
"legendFormat": "I/O使用率"
}
],
"type": "graph"
},
{
"title": "各分区使用情况",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{mountpoint!=\"\"} / node_filesystem_size_bytes{mountpoint!=\"\"})) * 100",
"legendFormat": "{{mountpoint}}"
}
],
"type": "table"
}
]
}
}

服务存活监控仪表板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
{
"dashboard": {
"title": "服务存活监控面板",
"panels": [
{
"title": "服务存活状态",
"targets": [
{
"expr": "probe_success{job=\"blackbox\"}",
"legendFormat": "{{instance}}"
}
],
"type": "stat",
"thresholds": [
{"value": 0, "color": "red"},
{"value": 1, "color": "green"}
]
},
{
"title": "服务响应时间",
"targets": [
{
"expr": "probe_http_duration_seconds{job=\"blackbox\"}",
"legendFormat": "{{instance}}"
}
],
"type": "graph"
},
{
"title": "HTTP状态码",
"targets": [
{
"expr": "probe_http_status_code{job=\"blackbox\"}",
"legendFormat": "{{instance}}"
}
],
"type": "table"
},
{
"title": "服务可用性",
"targets": [
{
"expr": "avg_over_time(probe_success{job=\"blackbox\"}[5m]) * 100",
"legendFormat": "{{instance}}可用率"
}
],
"type": "gauge",
"thresholds": [
{"value": 0, "color": "red"},
{"value": 95, "color": "yellow"},
{"value": 99, "color": "green"}
]
}
]
}
}

4.2 综合监控仪表板

基础设施监控大盘

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
{
"dashboard": {
"title": "基础设施监控大盘",
"rows": [
{
"title": "概览指标",
"panels": [
{
"title": "服务器总览",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
},
{
"title": "在线服务器",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}
},
{
"title": "告警总数",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0}
}
]
},
{
"title": "CPU监控",
"panels": [
{
"title": "CPU使用率",
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
},
{
"title": "Top CPU进程",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
}
]
},
{
"title": "内存监控",
"panels": [
{
"title": "内存使用率",
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
},
{
"title": "内存详细分布",
"type": "piechart",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
}
]
},
{
"title": "磁盘监控",
"panels": [
{
"title": "磁盘使用率",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20}
},
{
"title": "磁盘I/O",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 20}
}
]
},
{
"title": "服务监控",
"panels": [
{
"title": "服务存活状态",
"type": "stat",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 28}
}
]
}
]
}
}

五、Spring Boot应用集成监控

5.1 Spring Boot Actuator配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
base-path: /actuator
endpoint:
health:
show-details: always
metrics:
enabled: true
metrics:
export:
prometheus:
enabled: true
step: 10s
descriptions: true
tags:
application: user-service
environment: production
distribution:
percentiles:
"http.server.requests": 0.5, 0.9, 0.95, 0.99
"jdbc.query.timing": 0.5, 0.9, 0.95, 0.99

5.2 自定义业务指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@Component
@Slf4j
public class BusinessMetrics {

private final Counter orderCounter;
private final Counter paymentCounter;
private final Counter errorCounter;
private final Summary orderAmountSummary;
private final Gauge activeUserGauge;

public BusinessMetrics(MeterRegistry registry) {
// 订单计数器
this.orderCounter = Counter.builder("business.order.total")
.description("总订单数")
.tag("type", "order")
.register(registry);

// 支付计数器
this.paymentCounter = Counter.builder("business.payment.total")
.description("总支付数")
.tag("type", "payment")
.register(registry);

// 错误计数器
this.errorCounter = Counter.builder("business.error.total")
.description("业务错误数")
.tag("level", "error")
.register(registry);

// 订单金额统计
this.orderAmountSummary = Summary.builder("business.order.amount")
.description("订单金额统计")
.register(registry);

// 活跃用户数
this.activeUserGauge = Gauge.builder("business.user.active",
this, BusinessMetrics::getActiveUserCount)
.description("活跃用户数")
.register(registry);
}

public void recordOrder(double amount) {
orderCounter.increment();
orderAmountSummary.record(amount);
log.info("记录订单指标: amount={}", amount);
}

public void recordPayment() {
paymentCounter.increment();
log.info("记录支付指标");
}

public void recordError(String errorType) {
errorCounter.increment();
log.error("记录错误指标: type={}", errorType);
}

private double getActiveUserCount() {
// 从缓存或数据库获取活跃用户数
return RedisUtil.getLong("active:user:count").orElse(0L).doubleValue();
}
}

5.3 JVM监控指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@Component
@Slf4j
public class JVMMetrics {

@PostConstruct
public void init() {
// JVM堆内存监控
Gauge.builder("jvm.memory.heap.used", () ->
ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed())
.description("JVM堆内存已使用")
.register(Metrics.globalRegistry);

// JVM堆内存最大
Gauge.builder("jvm.memory.heap.max", () ->
ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getMax())
.description("JVM堆内存最大")
.register(Metrics.globalRegistry);

// GC次数监控
Arrays.stream(ManagementFactory.getGarbageCollectorMXBeans())
.forEach(gcBean -> {
Gauge.builder("jvm.gc.count", gcBean::getCollectionCount)
.description("GC次数")
.tag("name", gcBean.getName())
.register(Metrics.globalRegistry);

Gauge.builder("jvm.gc.time", gcBean::getCollectionTime)
.description("GC耗时")
.tag("name", gcBean.getName())
.register(Metrics.globalRegistry);
});

// 线程监控
Gauge.builder("jvm.thread.count",
ManagementFactory.getThreadMXBean()::getThreadCount)
.description("JVM线程数")
.register(Metrics.globalRegistry);

log.info("JVM监控指标初始化完成");
}
}

六、高可用部署架构

6.1 Prometheus高可用部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# docker-compose-prometheus-ha.yml
version: '3.8'

services:
# Prometheus主实例
prometheus-1:
image: prom/prometheus:latest
container_name: prometheus-1
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus-1.yml:/etc/prometheus/prometheus.yml
- prometheus-1-data:/prometheus
ports:
- "9091:9090"
networks:
- prometheus-network

# Prometheus备实例
prometheus-2:
image: prom/prometheus:latest
container_name: prometheus-2
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus-2.yml:/etc/prometheus/prometheus.yml
- prometheus-2-data:/prometheus
ports:
- "9092:9090"
networks:
- prometheus-network

# Thanos Sidecar
thanos-sidecar-1:
image: thanosio/thanos:latest
container_name: thanos-sidecar-1
command:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/storage.yaml
volumes:
- ./storage.yaml:/etc/thanos/storage.yaml
- prometheus-1-data:/prometheus
depends_on:
- prometheus-1
networks:
- prometheus-network

thanos-sidecar-2:
image: thanosio/thanos:latest
container_name: thanos-sidecar-2
command:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/storage.yaml
volumes:
- ./storage.yaml:/etc/thanos/storage.yaml
- prometheus-2-data:/prometheus
depends_on:
- prometheus-2
networks:
- prometheus-network

# Thanos Query
thanos-query:
image: thanosio/thanos:latest
container_name: thanos-query
command:
- query
- --query.auto-downsampling
- --store=thanos-sidecar-1:10901
- --store=thanos-sidecar-2:10901
- --store=thanos-store-gateway:10901
ports:
- "10902:10902"
depends_on:
- thanos-sidecar-1
- thanos-sidecar-2
networks:
- prometheus-network

# Alertmanager集群
alertmanager-1:
image: prom/alertmanager:latest
container_name: alertmanager-1
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-2:9094'
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-1-data:/alertmanager
ports:
- "9093:9093"
networks:
- prometheus-network

alertmanager-2:
image: prom/alertmanager:latest
container_name: alertmanager-2
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-2-data:/alertmanager
networks:
- prometheus-network

# Grafana
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_SERVER_ROOT_URL=http://localhost:3000
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
depends_on:
- prometheus-1
- prometheus-2
networks:
- prometheus-network

volumes:
prometheus-1-data:
prometheus-2-data:
alertmanager-1-data:
alertmanager-2-data:
grafana-data:

networks:
prometheus-network:
driver: bridge

6.2 Kubernetes部署配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
selector:
app: prometheus
---
# prometheus-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
endpoints:
- port: metrics
interval: 30s

七、监控最佳实践

7.1 监控指标体系设计

黄金指标(4个黄金信号)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1. 延迟(Latency)
- 请求处理时间
- 数据库查询时间
- 缓存响应时间

2. 流量(Traffic)
- QPS(每秒查询数)
- TPS(每秒事务数)
- 并发连接数

3. 错误(Errors)
- 错误率
- 5xx状态码数量
- 超时次数

4. 饱和度(Saturation)
- CPU使用率
- 内存使用率
- 磁盘I/O使用率
- 队列长度

业务指标设计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
业务指标分类:
核心业务指标:
- 订单量、订单金额
- 支付成功率
- 用户活跃度
- 转化率

关键业务指标:
- 注册用户数
- 日活/月活
- GMV(成交总额)
- ARPU(单用户价值)

应用性能指标:
- 接口响应时间(P50/P90/P99)
- 接口错误率
- 数据库连接池使用率
- 缓存命中率

7.2 告警规则设计原则

1
2
3
4
5
6
告警原则:
1. 告警要有意义,避免告警风暴
2. 告警要有可操作性,明确处理步骤
3. 告警要分级,区分严重程度
4. 告警要收敛,避免重复告警
5. 告警要有恢复机制

告警级别定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
告警级别:
Critical(严重):
- 服务完全不可用
- 数据丢失风险
- 安全漏洞
处理时间: 立即处理

Warning(警告):
- 服务性能下降
- 资源使用率过高
- 错误率上升
处理时间: 4小时内

Info(信息):
- 正常运维事件
- 配置变更通知
处理时间: 24小时内

7.3 监控数据存储策略

存储周期规划

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
存储策略:
原始数据(Raw):
- 存储周期: 7
- 采集频率: 15
- 用途: 详细故障分析

5分钟数据(5m):
- 存储周期: 30
- 采集频率: 5分钟
- 用途: 趋势分析

1小时数据(1h):
- 存储周期: 90
- 采集频率: 1小时
- 用途: 长期趋势

八、监控系统落地案例

8.1 电商系统监控案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
电商系统监控架构:
应用层监控:
- Spring Boot Actuator /actuator/metrics
- 自定义业务指标
- 接口耗时监控

中间件监控:
- Redis: redis_exporter
- MySQL: mysql_exporter
- Kafka: kafka_exporter
- RocketMQ: rocketmq_exporter

基础设施监控:
- Node Exporter: CPU/内存/磁盘
- Network Exporter: 网络流量
- JMX Exporter: JVM指标

业务监控:
- 订单量、支付量
- 库存变化
- 优惠券使用情况
- 秒杀活动指标

8.2 微服务监控案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
微服务监控方案:
服务发现:
- Consul服务注册
- Kubernetes Service Discovery
- 自动发现Service

分布式追踪:
- Jaeger: 请求链路追踪
- Zipkin: Span追踪
- SkyWalking: APM监控

日志聚合:
- ELK Stack
- Loki + Grafana
- Fluentd日志收集

指标监控:
- Prometheus采集
- Grafana可视化
- Alertmanager告警

九、总结

企业级监控系统是现代IT架构的重要组成部分,本文深入探讨了:

核心要点

  1. 双平台选择:Zabbix用于传统基础设施,Prometheus用于云原生应用
  2. 多维度监控:从系统层到业务层全覆盖
  3. 高可用架构:多节点部署、数据冗余、故障转移
  4. 智能化运维:预测性告警、自动化响应

技术栈

  • 监控引擎:Prometheus + Zabbix
  • 数据采集:Node Exporter + 各种Exporter
  • 可视化:Grafana + Zabbix Dashboard
  • 告警:Alertmanager + Zabbix Action
  • 存储:TSDB + MySQL

实践建议

  1. 从小规模开始,逐步扩展
  2. 建立完善的监控指标体系
  3. 制定清晰的告警策略
  4. 定期回顾和优化监控方案

通过合理的监控架构设计,企业可以实现对系统的全面掌控,提升系统稳定性和服务质量。