服务宕机监控架构实战:多层检测、自动恢复与智能告警

一、服务宕机监控概述

1.1 服务宕机的危害

服务宕机可能导致的严重后果:

  • 业务中断:用户无法访问服务,造成业务损失
  • 数据丢失:正在处理的数据可能丢失
  • 级联故障:可能引发其他服务的故障
  • 声誉受损:影响用户信任和企业形象
  • SLA违约:违反服务级别协议,可能产生赔偿责任

1.2 服务宕机类型

硬件故障

  • 服务器宕机
  • 网络设备故障
  • 磁盘故障
  • 内存故障

软件故障

  • 应用程序崩溃
  • 内存泄漏导致OOM
  • 死锁导致服务阻塞
  • 数据库连接异常

资源耗尽

  • CPU使用率100%
  • 内存耗尽
  • 磁盘空间满
  • 网络连接数过多

外部依赖故障

  • 数据库连接失败
  • 第三方API不可用
  • 消息队列阻塞
  • 缓存服务异常

二、多层监控架构设计

2.1 监控层次划分

层次一:基础设施监控

1
2
3
4
5
6
7
8
9
10
监控目标:
- 服务器存活
- CPU/内存/磁盘状态
- 网络连接状态
- 进程存活状态

监控工具:
- Node Exporter
- Process Exporter
- SNMP监控

层次二:应用服务监控

1
2
3
4
5
6
7
8
9
10
监控目标:
- 服务端口监听
- HTTP健康检查端点
- 进程状态
- 日志异常

监控工具:
- Blackbox Exporter
- Custom Health Check
- Actuator Health

层次三:业务逻辑监控

1
2
3
4
5
6
7
8
9
10
监控目标:
- 接口可用性
- 业务流程健康
- 业务指标异常
- 数据一致性

监控工具:
- 自定义探活脚本
- 业务指标监控
- 分布式追踪

层次四:用户体验监控

1
2
3
4
5
6
7
8
9
10
监控目标:
- 用户访问成功率
- 页面加载时间
- 错误率
- 用户真实体验

监控工具:
- RUM (Real User Monitoring)
- APM工具
- 日志分析

2.2 黑盒监控设计

Blackbox Exporter配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200]
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
ip_protocol_fallback: false

http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"action":"health_check"}'
valid_status_codes: [200]

tcp_connect:
prober: tcp
timeout: 5s

icmp:
prober: icmp
timeout: 5s

grpc_healthcheck:
prober: grpc
timeout: 5s
grpc:
service: "service.HealthCheck"
tls: false

Prometheus监控配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# prometheus.yml
scrape_configs:
# 黑盒监控配置
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://api.example.com/health
labels:
service: user-service
env: production

- targets:
- http://api.example.com/api/health
labels:
service: order-service
env: production

- targets:
- http://www.example.com
labels:
service: web-frontend
env: production

relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

# TCP端口监控
- job_name: 'blackbox-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- mysql.example.com:3306
- redis.example.com:6379
- kafka.example.com:9092
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

# ICMP Ping监控
- job_name: 'blackbox-icmp'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.1.10
- 192.168.1.11
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115

三、心跳检测机制

3.1 心跳检测原理

心跳检测是一种主动的存活检测机制,服务定期向监控中心发送心跳信号。

基于心跳的监控架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
心跳检测架构:
心跳发送方:
- 应用服务发送心跳
- 心跳间隔: 30
- 心跳内容: 服务ID、时间戳、状态信息

心跳接收方:
- Redis存储心跳
- Kafka消息队列
- 数据库记录

心跳处理器:
- 检测心跳超时
- 判定服务状态
- 触发告警机制

3.2 心跳检测实现

Spring Boot心跳实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
@Component
@Slf4j
public class HeartbeatSender {

@Autowired
private RedisTemplate<String, String> redisTemplate;

@Autowired
private ApplicationContext applicationContext;

@Value("${heartbeat.interval:30000}")
private long heartbeatInterval;

@Value("${spring.application.name}")
private String applicationName;

@PostConstruct
public void startHeartbeat() {
Thread heartbeatThread = new Thread(() -> {
while (!Thread.currentThread().isInterrupted()) {
try {
sendHeartbeat();
Thread.sleep(heartbeatInterval);
} catch (InterruptedException e) {
log.info("心跳线程被中断");
break;
} catch (Exception e) {
log.error("发送心跳失败", e);
}
}
});
heartbeatThread.setDaemon(true);
heartbeatThread.setName("heartbeat-sender");
heartbeatThread.start();
log.info("心跳检测服务已启动,间隔: {}ms", heartbeatInterval);
}

private void sendHeartbeat() {
try {
String instanceId = getInstanceId();
HeartbeatInfo heartbeat = HeartbeatInfo.builder()
.serviceName(applicationName)
.instanceId(instanceId)
.status("UP")
.timestamp(System.currentTimeMillis())
.hostInfo(getHostInfo())
.healthInfo(getHealthInfo())
.build();

// 存储到Redis
String key = String.format("heartbeat:%s:%s", applicationName, instanceId);
String value = JSON.toJSONString(heartbeat);

redisTemplate.opsForValue().set(key, value, 60, TimeUnit.SECONDS);

// 发送到Kafka
// kafkaTemplate.send("heartbeat-topic", instanceId, value);

log.debug("发送心跳成功: instanceId={}", instanceId);
} catch (Exception e) {
log.error("发送心跳异常", e);
}
}

private String getInstanceId() {
// 获取实例ID(可以从Spring Cloud获取或自己生成UUID)
InetAddress localhost = InetAddress.getLocalHost();
String hostAddress = localhost.getHostAddress();
String port = environment.getProperty("server.port", "8080");
return hostAddress + ":" + port;
}

private HostInfo getHostInfo() {
RuntimeMXBean runtimeBean = ManagementFactory.getRuntimeMXBean();
SystemBean systemBean = ManagementFactory.getOperatingSystemMXBean();
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();

return HostInfo.builder()
.hostname(InetAddress.getLocalHost().getHostName())
.uptime(runtimeBean.getUptime())
.cpuLoad(systemBean.getProcessCpuLoad())
.memoryUsed(memoryBean.getHeapMemoryUsage().getUsed())
.build();
}

private HealthInfo getHealthInfo() {
// 获取健康检查信息
HealthEndpoint healthEndpoint = applicationContext.getBean(HealthEndpoint.class);
Health health = healthEndpoint.health();

return HealthInfo.builder()
.status(health.getStatus().getCode())
.details(health.getDetails())
.build();
}
}

心跳检测器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
@Component
@Slf4j
public class HeartbeatDetector {

@Autowired
private RedisTemplate<String, String> redisTemplate;

@Autowired
private AlertManager alertManager;

@Value("${heartbeat.timeout:90000}")
private long heartbeatTimeout;

private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);

@PostConstruct
public void startDetection() {
// 每10秒检测一次心跳
scheduler.scheduleAtFixedRate(this::detectHeartbeat, 0, 10, TimeUnit.SECONDS);
log.info("心跳检测器已启动");
}

private void detectHeartbeat() {
try {
Set<String> keys = redisTemplate.keys("heartbeat:*");

for (String key : keys) {
String value = redisTemplate.opsForValue().get(key);
if (value == null) {
continue;
}

HeartbeatInfo heartbeat = JSON.parseObject(value, HeartbeatInfo.class);
long currentTime = System.currentTimeMillis();
long timeDiff = currentTime - heartbeat.getTimestamp();

// 判断心跳是否超时
if (timeDiff > heartbeatTimeout) {
log.warn("服务心跳超时: service={}, instanceId={}, 超时时间={}ms",
heartbeat.getServiceName(),
heartbeat.getInstanceId(),
timeDiff);

// 判定服务宕机
markServiceDown(heartbeat);

// 发送告警
sendDownAlert(heartbeat);
}
}
} catch (Exception e) {
log.error("心跳检测异常", e);
}
}

private void markServiceDown(HeartbeatInfo heartbeat) {
String downKey = String.format("service:down:%s:%s",
heartbeat.getServiceName(),
heartbeat.getInstanceId());

// 标记服务为宕机状态
redisTemplate.opsForValue().set(downKey,
"DOWN",
10, TimeUnit.MINUTES);
}

private void sendDownAlert(HeartbeatInfo heartbeat) {
AlertInfo alert = AlertInfo.builder()
.alertType("SERVICE_DOWN")
.severity("CRITICAL")
.serviceName(heartbeat.getServiceName())
.instanceId(heartbeat.getInstanceId())
.description(String.format("服务 %s 实例 %s 心跳超时,判定为宕机",
heartbeat.getServiceName(),
heartbeat.getInstanceId()))
.timestamp(System.currentTimeMillis())
.metadata(heartbeat)
.build();

alertManager.sendAlert(alert);
}
}

3.3 心跳检测最佳实践

心跳间隔设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
心跳间隔配置建议:
关键服务: 10-30
- 支付服务
- 交易服务
- 核心业务服务

一般服务: 30-60
- 用户服务
- 订单服务
- API网关

低频服务: 60-120
- 报表服务
- 数据分析服务
- 后台管理服务

心跳超时配置

1
2
3
4
超时判定规则:
- 连续3次心跳超时 -> 轻度告警
- 连续5次心跳超时 -> 中度告警
- 连续10次心跳超时 -> 重度告警+自动恢复

四、健康检查端点设计

4.1 Spring Boot健康检查

1
2
3
4
5
6
7
8
9
10
11
12
# application.yml
management:
endpoint:
health:
show-details: always
show-components: always
probes:
enabled: true
endpoints:
web:
exposure:
include: health,info,metrics

自定义健康检查实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
@Component
public class ServiceHealthIndicator implements HealthIndicator {

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Autowired
private DataSource dataSource;

@Autowired
private Environment environment;

@Override
public Health health() {
Health.Builder builder = new Health.Builder();

// 检查Redis连接
boolean redisHealthy = checkRedis();
builder.withDetail("redis", redisHealthy ? "UP" : "DOWN");

// 检查数据库连接
boolean dbHealthy = checkDatabase();
builder.withDetail("database", dbHealthy ? "UP" : "DOWN");

// 检查内存使用
MemoryStatus memoryStatus = checkMemory();
builder.withDetail("memory", memoryStatus);

// 检查磁盘空间
DiskStatus diskStatus = checkDisk();
builder.withDetail("disk", diskStatus);

// 综合判断
if (redisHealthy && dbHealthy &&
memoryStatus.isHealthy() && diskStatus.isHealthy()) {
return builder.up().build();
} else {
return builder.down().build();
}
}

private boolean checkRedis() {
try {
redisTemplate.opsForValue().set("health:check", "ok", 10, TimeUnit.SECONDS);
return true;
} catch (Exception e) {
return false;
}
}

private boolean checkDatabase() {
try {
Connection connection = dataSource.getConnection();
boolean valid = connection.isValid(2);
connection.close();
return valid;
} catch (SQLException e) {
return false;
}
}

private MemoryStatus checkMemory() {
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();

long max = heapUsage.getMax();
long used = heapUsage.getUsed();
double usagePercent = (double) used / max * 100;

return MemoryStatus.builder()
.used(used)
.max(max)
.usagePercent(usagePercent)
.isHealthy(usagePercent < 90)
.build();
}

private DiskStatus checkDisk() {
Path path = Paths.get(environment.getProperty("java.io.tmpdir"));
FileStore store = Files.getFileStore(path);

long total = store.getTotalSpace();
long usable = store.getUsableSpace();
long used = total - usable;
double usagePercent = (double) used / total * 100;

return DiskStatus.builder()
.total(total)
.used(used)
.usable(usable)
.usagePercent(usagePercent)
.isHealthy(usagePercent < 85)
.build();
}
}

4.2 Kubernetes健康检查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 3
template:
spec:
containers:
- name: user-service
image: user-service:latest
ports:
- containerPort: 8080

# Liveness探针
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1

# Readiness探针
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1

# Startup探针
startupProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 30
successThreshold: 1

五、告警规则设计

5.1 Prometheus告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# service-down-alerts.yml
groups:
- name: service_down_alerts
interval: 30s
rules:
# 服务不可用告警
- alert: ServiceDown
expr: probe_success{job="blackbox-http"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用"
description: "服务 {{ $labels.instance }} 不可用,健康检查失败超过1分钟"
runbook_url: "https://wiki.example.com/runbook/service-down"

# 服务响应缓慢
- alert: ServiceSlow
expr: probe_http_duration_seconds{job="blackbox-http"} > 3
for: 5m
labels:
severity: warning
annotations:
summary: "服务响应缓慢"
description: "服务 {{ $labels.instance }} 响应时间超过3秒 (当前: {{ $value }}s)"

# HTTP状态码异常
- alert: ServiceHTTPError
expr: probe_http_status_code{job="blackbox-http"} != 200
for: 1m
labels:
severity: critical
annotations:
summary: "服务返回错误状态码"
description: "服务 {{ $labels.instance }} 返回HTTP状态码 {{ $value }}"

# TCP连接失败
- alert: TCPConnectionFailed
expr: probe_success{job="blackbox-tcp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "TCP连接失败"
description: "服务 {{ $labels.instance }} TCP连接失败"

# 心跳超时
- alert: HeartbeatTimeout
expr: (time() - heartbeat_last_time{job="heartbeat"}) > 120
for: 1m
labels:
severity: critical
annotations:
summary: "服务心跳超时"
description: "服务 {{ $labels.service }} 实例 {{ $labels.instance }} 心跳超时"

# 多个实例同时宕机
- alert: MultipleInstancesDown
expr: count(probe_success{job="blackbox-http"} == 0) > 2
for: 2m
labels:
severity: critical
annotations:
summary: "多个服务实例宕机"
description: "有 {{ $value }} }} 个服务实例同时宕机,可能存在集群级别故障"

# 服务可用性下降
- alert: ServiceAvailabilityDrop
expr: (avg_over_time(probe_success{job="blackbox-http"}[15m]) * 100) < 99
for: 5m
labels:
severity: warning
annotations:
summary: "服务可用性下降"
description: "服务 {{ $labels.instance }} 15分钟可用性低于99% (当前: {{ $value }}%)"

5.2 告警路由配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'severity']
group_wait: 10s
group_interval: 5m
repeat_interval: 12h

routes:
# 服务宕机告警 - 立即发送给运维团队
- match:
alertname: ServiceDown
receiver: 'ops-team'
group_wait: 0s
continue: false

# 心跳超时告警 - 发送给开发团队
- match:
alertname: HeartbeatTimeout
receiver: 'dev-team'
group_wait: 30s
continue: true

# 多个实例宕机 - 严重告警
- match:
alertname: MultipleInstancesDown
receiver: 'critical-alert'
group_wait: 0s
continue: true

inhibit_rules:
# 多个实例宕机时,抑制单个实例告警
- source_match:
alertname: 'MultipleInstancesDown'
target_match:
alertname: 'ServiceDown'
equal: ['service', 'cluster']

receivers:
- name: 'ops-team'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
headers:
Subject: '[CRITICAL] 服务宕机告警'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
send_resolved: true

- name: 'dev-team'
email_configs:
- to: 'dev-team@example.com'
send_resolved: true

- name: 'critical-alert'
email_configs:
- to: 'oncall@example.com'
subject: '[P0] 服务集群宕机'
sms_configs:
- to: '13800138000'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=critical'

六、自动故障恢复

6.1 自动恢复策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
故障恢复策略:
策略1: 自动重启
- 检测到服务宕机
- 尝试重启服务3次
- 每次间隔30秒
- 3次失败后升级告警

策略2: 流量切换
- 将流量切换到备用服务
- 标记原服务为不可用
- 通知运维团队

策略3: 弹性扩容
- 检测到服务压力过大
- 自动扩容新实例
- 负载均衡到新实例
- 清理旧实例

6.2 自动恢复实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#!/usr/bin/env python
# auto_recovery.py - 自动故障恢复脚本

import os
import sys
import time
import requests
import subprocess
import logging
from typing import Dict, List

logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class ServiceRecovery:
def __init__(self):
self.max_retry = 3
self.retry_interval = 30
self.health_check_url = os.getenv('HEALTH_CHECK_URL', 'http://localhost:8080/actuator/health')

def check_service_health(self) -> bool:
"""检查服务健康状态"""
try:
response = requests.get(self.health_check_url, timeout=5)
return response.status_code == 200
except Exception as e:
logger.error(f"健康检查失败: {e}")
return False

def restart_service(self) -> bool:
"""重启服务"""
try:
# 使用systemd重启服务
result = subprocess.run(
['systemctl', 'restart', 'user-service'],
capture_output=True,
text=True,
timeout=60
)

if result.returncode == 0:
logger.info("服务重启成功")
return True
else:
logger.error(f"服务重启失败: {result.stderr}")
return False
except Exception as e:
logger.error(f"执行重启命令异常: {e}")
return False

def scale_up_service(self, service_name: str, replicas: int) -> bool:
"""扩容服务实例"""
try:
# Kubernetes扩容
result = subprocess.run(
['kubectl', 'scale', 'deployment', service_name,
f'--replicas={replicas}', '--namespace=default'],
capture_output=True,
text=True,
timeout=60
)

if result.returncode == 0:
logger.info(f"服务扩容成功: {service_name} -> {replicas}")
return True
else:
logger.error(f"服务扩容失败: {result.stderr}")
return False
except Exception as e:
logger.error(f"执行扩容命令异常: {e}")
return False

def trigger_alert(self, message: str):
"""触发告警"""
# 发送到钉钉
webhook_url = os.getenv('DINGTALK_WEBHOOK')
if webhook_url:
payload = {
"msgtype": "text",
"text": {
"content": f"【自动恢复失败】\n{message}"
}
}
requests.post(webhook_url, json=payload)

# 发送到邮件
# email.send(to='ops@example.com', subject='自动恢复失败', body=message)

def auto_recover(self, service_name: str = None) -> bool:
"""自动恢复流程"""
logger.info("开始自动故障恢复流程")

# 检查服务健康状态
if self.check_service_health():
logger.info("服务健康,无需恢复")
return True

logger.warning("服务不健康,开始尝试恢复")

# 尝试重启服务
for attempt in range(1, self.max_retry + 1):
logger.info(f"尝试重启服务 (第{attempt}次)")

if self.restart_service():
# 等待服务启动
time.sleep(10)

# 再次检查健康状态
if self.check_service_health():
logger.info("服务恢复成功")
return True

# 等待后重试
if attempt < self.max_retry:
logger.info(f"等待{self.retry_interval}秒后重试")
time.sleep(self.retry_interval)

# 重启失败,尝试扩容
if service_name:
logger.info("重启失败,尝试扩容服务")
if self.scale_up_service(service_name, 2):
logger.info("扩容成功,服务应已恢复")
return True

# 所有恢复方式都失败
logger.error("自动恢复失败,触发人工告警")
self.trigger_alert(f"服务 {service_name} 自动恢复失败,需要人工介入")
return False

if __name__ == '__main__':
recovery = ServiceRecovery()
service_name = sys.argv[1] if len(sys.argv) > 1 else 'user-service'
success = recovery.auto_recover(service_name)
sys.exit(0 if success else 1)

6.3 自动恢复集成

Alertmanager配置自动恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# alertmanager.yml
receivers:
- name: 'ops-team'
# ... 其他配置 ...

- name: 'auto-recover'
# 自动恢复接收器
webhook_configs:
- url: 'http://auto-recovery-service:8080/recover'
send_resolved: false
http_config:
basic_auth:
username: 'recovery'
password: 'password'

Alertmanager路由配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
route:
routes:
# 尝试自动恢复
- match:
alertname: ServiceDown
receiver: 'auto-recover'
group_wait: 0s
continue: true

# 恢复失败后升级告警
- match:
alertname: AutoRecoveryFailed
receiver: 'ops-team-critical'
group_wait: 0s

七、服务存活监控最佳实践

7.1 监控指标设计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
监控指标优先级:
P0指标 (关键指标):
- 服务存活状态 (0/1)
- 服务响应时间
- 服务错误率
- 心跳超时状态

P1指标 (重要指标):
- CPU/内存使用率
- 请求QPS
- 数据库连接数
- 缓存命中率

P2指标 (一般指标):
- 日志错误数
- 慢查询数
- 业务指标

7.2 故障分级处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
故障分级:
一级故障 (P0 - 严重):
- 服务完全不可用
- 数据丢失风险
处理时间: 立即处理
告警方式: 电话+短信+钉钉

二级故障 (P1 - 重要):
- 服务性能下降50%
- 部分功能不可用
处理时间: 15分钟内
告警方式: 钉钉+邮件

三级故障 (P2 - 一般):
- 服务性能下降10%
- 偶发错误
处理时间: 1小时内
告警方式: 邮件

7.3 监控Dashboard配置

Grafana服务监控面板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
{
"dashboard": {
"title": "服务存活监控大盘",
"panels": [
{
"title": "服务存活状态",
"targets": [
{
"expr": "probe_success{job=\"blackbox-http\"}",
"legendFormat": "{{instance}}"
}
],
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"thresholds": [
{"value": 0, "color": "red"},
{"value": 1, "color": "green"}
]
},
{
"title": "服务可用性",
"targets": [
{
"expr": "avg_over_time(probe_success{job=\"blackbox-http\"}[1h]) * 100",
"legendFormat": "{{instance}}"
}
],
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
},
{
"title": "服务响应时间",
"targets": [
{
"expr": "probe_http_duration_seconds{job=\"blackbox-http\"}",
"legendFormat": "{{instance}}"
}
],
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
},
{
"title": "服务告警列表",
"targets": [
{
"expr": "ALERTS{job=\"blackbox-http\",alertstate=\"firing\"}",
"legendFormat": "{{alertname}} - {{instance}}"
}
],
"type": "table",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 12}
}
]
}
}

八、高可用服务监控案例

8.1 电商系统监控案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
电商系统监控架构:
服务类型:
- Web前端 (3个实例)
- API网关 (2个实例)
- 用户服务 (3个实例)
- 订单服务 (5个实例)
- 支付服务 (5个实例)
- 库存服务 (3个实例)

监控方案:
- Blackbox监控: 所有HTTP服务
- 心跳检测: 所有微服务
- 健康检查: /actuator/health
- 自动恢复: 重启+扩容

告警策略:
- P0: 支付/订单服务宕机 (立即电话)
- P1: 用户/库存服务宕机 (15分钟内处理)
- P2: 其他服务异常 (1小时内处理)

8.2 微服务监控实践

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
微服务监控最佳实践:
1. 多层监控:
- 基础设施监控 (服务器级别)
- 服务层监控 (应用级别)
- 业务层监控 (功能级别)

2. 多维度检测:
- 黑盒监控 (外部视角)
- 白盒监控 (内部指标)
- 心跳检测 (主动上报)

3. 快速响应:
- 告警立即通知
- 自动故障恢复
- 人工应急支持

4. 持续优化:
- 定期回顾告警
- 调整告警阈值
- 优化恢复策略

九、总结

服务宕机监控是保障系统稳定运行的关键环节。本文深入探讨了:

核心要点

  1. 多层监控:从基础设施到业务逻辑的全方位监控
  2. 多维度检测:黑盒监控、心跳检测、健康检查相结合
  3. 智能告警:分级告警、告警路由、告警抑制
  4. 自动恢复:自动重启、流量切换、弹性扩容

技术栈

  • 黑盒监控:Blackbox Exporter
  • 心跳检测:Redis + Kafka
  • 健康检查:Spring Boot Actuator
  • 告警管理:Prometheus + Alertmanager
  • 自动恢复:Shell脚本 + Kubernetes

实践建议

  1. 设置合理的监控频率:避免过于频繁影响性能
  2. 建立完善的通知机制:确保重要告警及时送达
  3. 制定清晰的恢复流程:自动化优先,人工兜底
  4. 定期演练故障场景:提高团队应急响应能力

通过完善的监控和快速响应机制,企业可以最大程度减少服务宕机带来的损失,提升系统的可靠性和用户体验。