第320集服务器故障排查架构实战:无法访问问题诊断、分层排查与系统级解决方案 | 字数总计: 5.3k | 阅读时长: 25分钟 | 阅读量:
服务器故障排查架构实战:无法访问问题诊断与分层排查 一、服务器无法访问概述 1.1 常见无法访问场景 服务器无法访问可能由多种原因导致,主要分为以下几个层面:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 服务器无法访问场景: 完全无法连接: - 服务器宕机 - 网络中断 - 防火墙完全阻断 SSH无法连接: - SSH服务未启动 - 端口被修改 - 防火墙规则错误 Web服务无法访问: - 应用服务未启动 - 端口未监听 - 反向代理配置错误 API无法调用: - 服务进程崩溃 - 数据库连接失败 - 依赖服务不可用
1.2 故障排查流程
排查原则 :
从底层到高层逐层排查
先确认网络连通性
再检查服务状态
最后分析应用日志
二、网络层排查 2.1 连通性测试 ping测试 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ping -c 4 127.0.0.1 ping -c 4 192.168.1.100 ping -c 4 192.168.1.1 ping -c 4 8.8.8.8 ping -c 100 192.168.1.100
ping结果分析 1 2 3 4 5 6 7 8 9 10 11 12 ping测试结果分析: 请求超时: 原因: 网络不通、防火墙阻断、服务器宕机 处理: 检查网络配置、检查防火墙 高丢包率: 原因: 网络拥塞、链路质量差 处理: 检查网络质量、联系ISP 高延迟: 原因: 网络拥塞、链路距离远 处理: 优化网络路径、调整服务器位置
2.2 网络接口检查 网络接口状态 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ip addr show ifconfig ip addr show eth0 ip -s link show eth0 ip route show route -n arp -a ip neigh show
网络接口诊断脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #!/bin/bash echo "=== 网络接口状态 ===" ip addr show echo "" echo "=== 路由表 ===" ip route show echo "" echo "=== ARP表 ===" ip neigh show echo "" echo "=== 网络连接统计 ===" ss -s echo "" echo "=== 测试连通性 ===" for ip in 127.0.0.1 192.168.1.1 8.8.8.8; do if ping -c 2 -W 1 "$ip " > /dev/null 2>&1; then echo "✓ $ip is reachable" else echo "✗ $ip is not reachable" fi done
三、服务层排查 3.1 SSH服务排查 SSH无法连接是最常见的服务器访问问题。
SSH服务检查步骤 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 systemctl status ssh systemctl status sshd netstat -tuln | grep :22 ss -tuln | grep :22 cat /etc/ssh/sshd_config | grep -E "^(Port|PermitRootLogin)" tail -f /var/log/auth.logjournalctl -u ssh -f ssh localhost sudo iptables -L -n sudo firewall-cmd --list-all getenforce
SSH故障排查脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 #!/bin/bash echo "=== SSH服务排查 ===" echo "" echo "1. SSH服务状态:" if systemctl is-active --quiet ssh; then echo " ✓ SSH服务运行中" else echo " ✗ SSH服务未运行" systemctl status ssh fi echo "" echo "2. SSH端口监听:" if netstat -tuln | grep -q :22; then echo " ✓ SSH端口22正在监听" netstat -tuln | grep :22 else echo " ✗ SSH端口未监听" echo " 检查SSH配置中的端口设置" fi echo "" echo "3. 防火墙检查:" if command -v firewall-cmd &> /dev/null; then if firewall-cmd --query-service=ssh > /dev/null 2>&1; then echo " ✓ SSH服务已在防火墙中" else echo " ✗ SSH服务未在防火墙中" echo " 执行: sudo firewall-cmd --add-service=ssh --permanent" fi elif command -v iptables &> /dev/null; then if iptables -L -n | grep -q "ACCEPT.*22" ; then echo " ✓ SSH端口22已在iptables中" else echo " ✗ SSH端口22可能被iptables阻断" fi fi echo "" echo "4. SSH配置检查:" SSH_PORT=$(grep "^Port" /etc/ssh/sshd_config | awk '{print $2}' ) SSH_PORT=${SSH_PORT:-22} echo " SSH端口: $SSH_PORT " PERMIT_ROOT=$(grep "^PermitRootLogin" /etc/ssh/sshd_config) echo " 允许Root登录: $PERMIT_ROOT " echo "" echo "5. 最近的SSH连接日志:" tail -20 /var/log/auth.log 2>/dev/null | grep -i ssh || echo " 无日志或日志文件不存在" echo "" echo "=== 推荐解决方案 ===" echo "如果SSH无法连接:" echo "1. 检查网络连通性: ping 服务器IP" echo "2. 检查SSH服务: systemctl restart ssh" echo "3. 检查防火墙: sudo firewall-cmd --reload" echo "4. 检查SSH配置: sudo vi /etc/ssh/sshd_config" echo "5. 查看日志: sudo journalctl -u ssh -n 50"
3.2 Web服务排查
Web服务检查步骤 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 systemctl status nginx systemctl status apache2 systemctl status httpd netstat -tuln | grep -E ":(80|443)" ss -tuln | grep -E ":(80|443)" curl -I http://localhost curl -I https://localhost sudo iptables -L -n | grep -E ":(80|443)" sudo firewall-cmd --list-ports tail -f /var/log/nginx/error.logtail -f /var/log/apache2/error.logps aux | grep nginx ps aux | grep apache
Web服务排查脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 #!/bin/bash SERVICE_PORT=${1:-80} SERVICE_NAME="nginx" echo "=== Web服务排查 ===" echo "" echo "1. 服务状态:" if systemctl is-active --quiet "$SERVICE_NAME " ; then echo " ✓ $SERVICE_NAME 运行中" else echo " ✗ $SERVICE_NAME 未运行" echo " 启动服务: systemctl start $SERVICE_NAME " fi echo "" echo "2. 端口监听状态:" if ss -tuln | grep -q ":$SERVICE_PORT " ; then echo " ✓ 端口 $SERVICE_PORT 正在监听" ss -tuln | grep ":$SERVICE_PORT " else echo " ✗ 端口 $SERVICE_PORT 未监听" fi echo "" echo "3. HTTP响应测试:" if curl -s -o /dev/null -w "%{http_code}" "http://localhost:$SERVICE_PORT " | grep -q "200\|301\|302" ; then echo " ✓ Web服务响应正常" curl -I "http://localhost:$SERVICE_PORT " 2>/dev/null | head -3 else echo " ✗ Web服务无响应" fi echo "" echo "4. 防火墙规则:" if command -v firewall-cmd &> /dev/null; then if firewall-cmd --query-port=$SERVICE_PORT /tcp > /dev/null 2>&1; then echo " ✓ 端口已在防火墙中" else echo " ✗ 端口未在防火墙中" echo " 添加规则: sudo firewall-cmd --add-port=$SERVICE_PORT /tcp --permanent" fi elif command -v iptables &> /dev/null; then if iptables -L -n | grep -q ":$SERVICE_PORT " ; then echo " ✓ 端口已在iptables中" else echo " ✗ 端口可能被iptables阻断" fi fi echo "" echo "5. 最近的错误日志:" if [ -f "/var/log/$SERVICE_NAME /error.log" ]; then echo " 最新错误:" tail -5 "/var/log/$SERVICE_NAME /error.log" 2>/dev/null else echo " 日志文件不存在或路径错误" fi
四、应用层排查 4.1 进程检查 进程状态检查 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ps aux | grep java jps -l ps aux | grep python ps aux | grep node ps aux | grep username lsof -i :8080 ss -tulpn | grep :8080 pstree -p
进程排查脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 #!/bin/bash APP_NAME="$1 " echo "=== 进程检查 ===" echo "" if [ -z "$APP_NAME " ]; then echo "用法: $0 <进程名>" echo "示例: $0 java" exit 1 fi echo "1. 查找 $APP_NAME 进程:" PROCESSES=$(ps aux | grep "$APP_NAME " | grep -v grep) if [ -n "$PROCESSES " ]; then echo "$PROCESSES " echo "" PIDS=$(echo "$PROCESSES " | awk '{print $2}' ) echo "2. 进程详细信息:" for PID in $PIDS ; do echo "PID: $PID " echo " - 命令行: $(ps -p $PID -o cmd=) " echo " - 内存使用: $(ps -p $PID -o rss=) KB" echo " - CPU使用: $(ps -p $PID -o %cpu=) %" echo "" done echo "3. 端口占用:" for PID in $PIDS ; do SOCKETS=$(lsof -p $PID 2>/dev/null | grep LISTEN) if [ -n "$SOCKETS " ]; then echo "PID $PID 监听的端口:" echo "$SOCKETS " | awk '{print " - " $1 " " $9}' fi done else echo "✗ 未找到 $APP_NAME 进程" echo "" echo "可能原因:" echo "1. 进程未启动" echo "2. 进程已崩溃" echo "3. 进程名搜索错误" fi
4.2 日志分析 应用日志检查 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 tail -f /var/log/application.logtail -100 /var/log/application.loggrep -i error /var/log/application.log grep "2024-11-18" /var/log/application.log grep -i error /var/log/application.log | wc -l tail -1000 /var/log/application.log | grep -i error | tail -20
日志分析脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 #!/bin/bash LOG_FILE="${1:-/var/log/application.log} " TIME_RANGE="${2:-1h} " if [ ! -f "$LOG_FILE " ]; then echo "日志文件不存在: $LOG_FILE " exit 1 fi echo "=== 日志分析 ===" echo "文件: $LOG_FILE " echo "时间范围: $TIME_RANGE " echo "" case $TIME_RANGE in 1h) SINCE="-1 hour" ;; 1d) SINCE="-1 day" ;; 1w) SINCE="-1 week" ;; *) SINCE="-1 day" ;; esac echo "1. 错误统计:" ERROR_COUNT=$(grep -i error "$LOG_FILE " | wc -l) echo " 总错误数: $ERROR_COUNT " echo "" echo "2. 最近的错误(最后10条):" grep -i error "$LOG_FILE " | tail -10 echo "" echo "3. 错误类型分布:" grep -i error "$LOG_FILE " | sed 's/.*ERROR/ERROR/i' | cut -d' ' -f2 | sort | uniq -c | sort -rn | head -10 echo "" echo "4. 错误时间分布:" grep -i error "$LOG_FILE " | grep -oE "[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}" | sort | uniq -c | tail -20
五、端口和防火墙排查 5.1 端口检测 端口监听检查 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 netstat -tuln ss -tuln netstat -tuln | grep :8080 ss -tuln | grep :8080 lsof -i :8080 fuser 8080/tcp lsof -p PID ss -tupn | grep PID nc -zv 192.168.1.100 8080 telnet 192.168.1.100 8080
端口检测脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #!/bin/bash HOST="${1:-localhost} " PORTS=("22" "80" "443" "8080" "9090" ) echo "=== 端口检测 ===" echo "主机: $HOST " echo "" for port in "${PORTS[@]} " ; do if nc -zv -w 3 "$HOST " "$port " > /dev/null 2>&1; then echo "✓ 端口 $port 开放" else echo "✗ 端口 $port 关闭或被阻断" fi done
5.2 防火墙检查 iptables防火墙 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 sudo iptables -L -n sudo iptables -L -n -v sudo iptables -t nat -L -n sudo iptables -L INPUT -n --line-numbers sudo iptables -I INPUT -p tcp --dport 8080 -j ACCEPT sudo iptables-save | sudo tee /etc/iptables/rules.v4
firewalld防火墙 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 sudo firewall-cmd --state sudo firewall-cmd --list-all sudo firewall-cmd --list-ports sudo firewall-cmd --list-services sudo firewall-cmd --add-port=8080/tcp sudo firewall-cmd --add-port=8080/tcp --permanent sudo firewall-cmd --reload sudo firewall-cmd --add-service=http --permanent sudo firewall-cmd --reload
六、数据库连接排查 6.1 数据库连接检查 MySQL连接检查 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 systemctl status mysql netstat -tuln | grep :3306 mysql -h 192.168.1.20 -u root -p -e "SELECT 1" mysql -u root -p -e "SHOW PROCESSLIST;" mysql -u root -p -e "SHOW STATUS;" mysql -u root -p -e "SHOW VARIABLES LIKE 'slow_query_log';"
Redis连接检查 1 2 3 4 5 6 7 8 9 10 11 12 13 14 systemctl status redis redis-cli ping redis-cli -h 192.168.1.30 ping redis-cli info redis-cli info clients
6.2 数据库连接脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 #!/bin/bash echo "=== 数据库连接检查 ===" echo "" echo "1. MySQL连接:" if command -v mysql &> /dev/null; then if mysql -h 127.0.0.1 -u root -p'password' -e "SELECT 1" > /dev/null 2>&1; then echo " ✓ MySQL本地连接正常" else echo " ✗ MySQL本地连接失败" fi if mysql -h 192.168.1.20 -u root -p'password' -e "SELECT 1" > /dev/null 2>&1; then echo " ✓ MySQL远程连接正常" else echo " ✗ MySQL远程连接失败" fi else echo " MySQL客户端未安装" fi echo "" echo "2. Redis连接:" if command -v redis-cli &> /dev/null; then if redis-cli -h 127.0.0.1 ping > /dev/null 2>&1; then echo " ✓ Redis本地连接正常" else echo " ✗ Redis本地连接失败" fi if redis-cli -h 192.168.1.30 ping > /dev/null 2>&1; then echo " ✓ Redis远程连接正常" else echo " ✗ Redis远程连接失败" fi else echo " Redis客户端未安装" fi echo "" echo "3. MongoDB连接:" if command -v mongo &> /dev/null; then if mongo --quiet --eval "db.stats()" > /dev/null 2>&1; then echo " ✓ MongoDB本地连接正常" else echo " ✗ MongoDB本地连接失败" fi else echo " MongoDB客户端未安装" fi
七、完整故障排查脚本 7.1 一站式排查脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 #!/bin/bash echo "==========================================" echo "服务器故障排查诊断" echo "==========================================" echo "" SERVER_IP="${1} " if [ -z "$SERVER_IP " ]; then SERVER_IP=$(hostname -I | awk '{print $1}' ) fi echo "目标服务器: $SERVER_IP " echo "" echo "=== 1. 网络层检查 ===" echo "" echo "1.1 Ping测试:" if ping -c 4 -W 2 "$SERVER_IP " > /dev/null 2>&1; then echo " ✓ 网络连通正常" ping -c 4 "$SERVER_IP " | tail -2 else echo " ✗ 网络不通" fi echo "" echo "1.2 网关测试:" GATEWAY=$(ip route | grep default | awk '{print $3}' ) if ping -c 2 "$GATEWAY " > /dev/null 2>&1; then echo " ✓ 网关 $GATEWAY 可达" else echo " ✗ 网关 $GATEWAY 不可达" fi echo "" echo "=== 2. 网络接口检查 ===" INTERFACES=$(ip link show | grep -E "^[0-9]+:" | awk -F': ' '{print $2}' ) for IFACE in $INTERFACES ; do echo "接口: $IFACE " ip addr show "$IFACE " | grep -E "inet |state " | sed 's/^/ /' done echo "" echo "=== 3. 服务层检查 ===" echo "" echo "3.1 SSH服务:" if systemctl is-active --quiet ssh; then echo " ✓ SSH服务运行中" else echo " ✗ SSH服务未运行" fi if ss -tuln | grep -q :22; then echo " ✓ SSH端口22监听中" else echo " ✗ SSH端口22未监听" fi echo "" echo "3.2 Web服务:" if systemctl is-active --quiet nginx; then echo " ✓ Nginx运行中" elif systemctl is-active --quiet apache2; then echo " ✓ Apache运行中" else echo " ✗ Web服务未运行" fi if ss -tuln | grep -q :80; then echo " ✓ HTTP端口80监听中" else echo " ✗ HTTP端口80未监听" fi if ss -tuln | grep -q :443; then echo " ✓ HTTPS端口443监听中" else echo " ✗ HTTPS端口443未监听" fi echo "" echo "=== 4. 防火墙检查 ===" if command -v firewall-cmd &> /dev/null; then echo "Firewalld状态:" firewall-cmd --state firewall-cmd --list-all | grep -E "ports:|services:" | sed 's/^/ /' elif command -v iptables &> /dev/null; then echo "Iptables规则数:" iptables -L -n | grep -c "ACCEPT\|REJECT" | awk '{print " " $1 " 条规则"}' fi echo "" echo "=== 5. 系统资源检查 ===" echo "" echo "5.1 CPU使用率:" top -bn1 | grep "Cpu(s)" | sed 's/.*, *\([0-9.]*\)%* id.*/\1/' | awk '{print " CPU空闲: " 100-$1 "%"}' echo "" echo "5.2 内存使用:" free -h | grep Mem | awk '{printf " 总内存: %s\n", $2; printf " 已使用: %s (%s)\n", $3, $5}' echo "" echo "5.3 磁盘使用:" df -h | grep -E "^/dev/" | awk '{print " " $1 ": " $3 " / " $2 " (" $5 ")"}' echo "" echo "=== 6. 进程检查 ===" echo "" echo "6.1 运行中的关键服务:" for service in nginx apache2 mysql redis; do if pgrep "$service " > /dev/null; then echo " ✓ $service 进程运行中" else echo " ✗ $service 进程未运行" fi done echo "" echo "=== 7. 日志检查 ===" echo "" echo "7.1 系统日志错误:" journalctl --no-pager -p err -n 5 | tail -5 echo "" echo "==========================================" echo "诊断完成" echo "=========================================="
八、实战排查案例 8.1 案例1:SSH无法连接 故障现象 1 2 $ ssh root@192.168.1.100 ssh: connect to host 192.168.1.100 port 22: Connection refused
排查步骤 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ping 192.168.1.100 nc -zv 192.168.1.100 22 systemctl status ssh systemctl start ssh systemctl enable ssh ssh root@192.168.1.100
8.2 案例2:Web服务无法访问 故障现象 1 2 $ curl http://www.example.com curl: (7) Failed to connect to www.example.com port 80: Connection timed out
排查步骤 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 systemctl status nginx systemctl start nginx netstat -tuln | grep :80 nginx -t vim /etc/nginx/nginx.conf nginx -s reload curl http://localhost
九、故障预防与监控 9.1 自动化监控脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 #!/bin/bash while true ; do for service in nginx mysql redis; do if ! systemctl is-active --quiet "$service " ; then echo "[$(date) ] 警告: $service 服务停止" systemctl restart "$service " fi done DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//' ) if [ "$DISK_USAGE " -gt 85 ]; then echo "[$(date) ] 警告: 磁盘使用率 ${DISK_USAGE} %" fi MEM_FREE=$(free | grep Mem | awk '{printf "%.0f", $4/$2*100}' ) if [ "$MEM_FREE " -lt 10 ]; then echo "[$(date) ] 警告: 可用内存 ${MEM_FREE} %" fi sleep 60 done
十、总结 服务器无法访问排查需要系统化的方法。本文内容包括:
核心要点
分层排查 :从网络层→传输层→应用层逐层定位
工具组合 :ping、traceroute、netstat、ss、curl等
自动化脚本 :快速定位问题
监控预警 :预防故障
排查思路
网络层:ping、traceroute、路由表
传输层:端口监听、防火墙
应用层:服务状态、进程、日志
实践建议
建立完善的监控体系
制定标准化排查流程
编写自动化诊断脚本
定期进行故障演练
通过系统化的排查方法,可以快速定位问题并恢复服务。