第258集CPU飙升,使用Arthas,3秒定位问题架构实战:Arthas诊断、性能分析与企业级故障快速定位设计 | 字数总计: 5.7k | 阅读时长: 26分钟 | 阅读量:
前言 CPU飙升是生产环境中最常见的性能问题之一,传统的排查方法往往需要重启应用、生成堆转储、分析日志等复杂操作,耗时较长且影响业务。Arthas作为阿里巴巴开源的Java诊断工具,能够在不重启应用的情况下,快速定位CPU飙升问题。本文从Arthas基础使用到高级诊断,从CPU分析到性能优化,系统梳理企业级CPU故障快速定位的完整解决方案。
一、Arthas诊断架构设计 1.1 Arthas诊断架构
1.2 Arthas核心功能架构
二、Arthas快速安装与配置 2.1 Arthas安装部署 2.1.1 在线安装 1 2 3 4 5 6 7 8 9 10 11 curl -O https://arthas.aliyun.com/arthas-boot.jar wget https://arthas.aliyun.com/arthas-boot.jar mvn dependency:get -Dartifact=com.taobao.arthas:arthas-packaging:3.6.7 java -jar arthas-boot.jar
2.1.2 离线安装 1 2 3 4 5 6 7 8 9 10 wget https://arthas.aliyun.com/download/arthas-packaging-3.6.7-bin.zip unzip arthas-packaging-3.6.7-bin.zip cd arthas-packaging-3.6.7./install-local.sh ./as.sh
2.1.3 Docker环境安装 1 2 3 4 5 6 7 8 9 10 11 FROM openjdk:8 -jdk-alpineRUN wget https://arthas.aliyun.com/arthas-boot.jar -O /opt/arthas-boot.jar WORKDIR /opt CMD ["java" , "-jar" , "arthas-boot.jar" ]
1 2 3 4 5 6 7 8 9 10 11 12 version: '3.8' services: arthas: image: arthas:latest container_name: arthas-diagnostic volumes: - /var/run/docker.sock:/var/run/docker.sock - /proc:/host/proc:ro environment: - JAVA_HOME=/usr/lib/jvm/java-8-openjdk command: java -jar /opt/arthas-boot.jar
2.2 Arthas配置优化 2.2.1 基础配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 arthas.home=/opt/arthas arthas.log.level=INFO arthas.log.file=/var/log/arthas/arthas.log arthas.log.max.size=100MB arthas.log.max.days=7 arthas.command.timeout=30000 arthas.command.history.max=1000 arthas.command.auto.completion=true arthas.command.alias.thread=t arthas.command.alias.monitor=m arthas.command.alias.watch=w
2.2.2 性能配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 arthas.sample.rate=0.1 arthas.monitor.max.methods=1000 arthas.monitor.max.time=300000 arthas.memory.limit=512MB arthas.cpu.limit=50% arthas.network.timeout=10000 arthas.retry.count=3
三、Arthas核心命令详解 3.1 线程分析命令 3.1.1 thread命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 thread thread -n 3 thread -b thread 1 thread -i 1000 thread --state BLOCKED thread --state WAITING thread --state RUNNABLE thread --state DEADLOCK thread -n 3 -i 2000
3.1.2 线程分析实战 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 public class ThreadAnalysisExample { private final ExecutorService executor = Executors.newFixedThreadPool(10 ); private final Object lock = new Object (); public void cpuIntensiveTask () { executor.submit(() -> { while (true ) { long result = 0 ; for (int i = 0 ; i < 1000000 ; i++) { result += Math.sqrt(i); } try { Thread.sleep(100 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); break ; } } }); } public void blockingTask () { executor.submit(() -> { synchronized (lock) { try { Thread.sleep(10000 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } }); } public void deadlockTask () { Object lock1 = new Object (); Object lock2 = new Object (); executor.submit(() -> { synchronized (lock1) { try { Thread.sleep(1000 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } synchronized (lock2) { } } }); executor.submit(() -> { synchronized (lock2) { try { Thread.sleep(1000 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } synchronized (lock1) { } } }); } }
3.2 性能分析命令 3.2.1 profiler命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 profiler start profiler start --interval 1000000 profiler start --event cpu profiler start --event alloc profiler start --event lock profiler stop profiler status profiler stop --format html profiler stop --format jfr profiler stop --format svg profiler stop --format text profiler stop --file /tmp/profile.html profiler --help
3.2.2 monitor命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 monitor -c 5 com.example.service.UserService getUserById monitor -c 5 -b com.example.service.UserService getUserById monitor -c 5 -s com.example.service.UserService getUserById monitor -c 5 -e com.example.service.UserService getUserById monitor -c 5 -b -s -e com.example.service.UserService getUserById monitor -c 5 -t 60 com.example.service.UserService getUserById monitor -c 5 --condition '#cost > 1000' com.example.service.UserService getUserById monitor -c 5 --express '#cost > 1000' com.example.service.UserService getUserById
3.2.3 watch命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 watch com.example.service.UserService getUserById watch com.example.service.UserService getUserById '{params,returnObj}' watch com.example.service.UserService getUserById '{params,returnObj,#cost}' watch com.example.service.UserService getUserById '{params,returnObj,throwExp}' watch com.example.service.UserService getUserById '{params,returnObj}' '#cost > 1000' watch com.example.service.UserService getUserById '{params,returnObj}' '#cost > 1000' watch com.example.service.UserService getUserById '{params,returnObj}' -n 10 watch com.example.service.UserService getUserById '{params,returnObj}' -t 60 watch com.example.service.UserService getUserById '{params,returnObj}' '#cost > 1000' -n 10
3.3 高级诊断命令 3.3.1 trace命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 trace com.example.service.UserService getUserById trace com.example.service.UserService getUserById '{params,returnObj}' trace com.example.service.UserService getUserById '#cost > 1000' trace com.example.service.UserService getUserById -n 10 trace com.example.service.UserService getUserById -t 60 trace com.example.service.UserService getUserById '#cost > 1000' -n 10 trace com.example.service.UserService getUserById '#cost > 1000' -n 10 -t 60
3.3.2 stack命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 stack com.example.service.UserService getUserById stack com.example.service.UserService getUserById '{params,returnObj}' stack com.example.service.UserService getUserById '#cost > 1000' stack com.example.service.UserService getUserById -n 10 stack com.example.service.UserService getUserById -t 60 stack com.example.service.UserService getUserById '#cost > 1000' -n 10 stack com.example.service.UserService getUserById '#cost > 1000' -n 10 -t 60
四、CPU飙升问题快速定位 4.1 CPU飙升定位流程
graph TD
A[CPU飙升告警] --> B[连接Arthas]
B --> C[查看线程状态]
C --> D{发现异常线程}
D -->|是| E[分析线程堆栈]
D -->|否| F[启动性能分析]
E --> G[定位问题方法]
F --> H[生成火焰图]
G --> I[分析问题根因]
H --> I
I --> J[制定解决方案]
J --> K[实施优化措施]
K --> L[验证优化效果]
L --> M[问题解决]
4.2 3秒定位CPU问题实战 4.2.1 快速连接与诊断 1 2 3 4 5 6 7 8 java -jar arthas-boot.jar thread -b thread [线程ID]
4.2.2 实战案例:CPU密集型任务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 @Service public class CpuIntensiveService { public void cpuIntensiveCalculation () { while (true ) { double result = 0 ; for (int i = 0 ; i < 1000000 ; i++) { result += Math.sqrt(i) * Math.sin(i) * Math.cos(i); } try { Thread.sleep(10 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); break ; } } } public void infiniteLoop () { while (true ) { int count = 0 ; for (int i = 0 ; i < Integer.MAX_VALUE; i++) { count++; if (count % 1000000 == 0 ) { System.out.println("Count: " + count); } } } } public long recursiveCalculation (int n) { if (n <= 1 ) { return 1 ; } return recursiveCalculation(n - 1 ) + recursiveCalculation(n - 2 ); } }
4.2.3 Arthas诊断命令 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 java -jar arthas-boot.jar thread -b thread 1 monitor -c 5 com.example.service.CpuIntensiveService cpuIntensiveCalculation watch com.example.service.CpuIntensiveService cpuIntensiveCalculation '{params,returnObj,#cost}'
4.3 性能分析实战 4.3.1 火焰图分析 1 2 3 4 5 6 7 8 9 10 11 profiler start sleep 30profiler stop --format html --file /tmp/cpu_profile.html open /tmp/cpu_profile.html
4.3.2 方法调用分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 monitor -c 5 com.example.service.CpuIntensiveService cpuIntensiveCalculation watch com.example.service.CpuIntensiveService cpuIntensiveCalculation '{params,returnObj,#cost}' '#cost > 1000'
4.4 问题根因分析 4.4.1 代码层面分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 public class ProblematicCode { public void infiniteLoop () { while (true ) { doSomething(); } } public void inefficientAlgorithm () { for (int i = 0 ; i < 10000 ; i++) { for (int j = 0 ; j < 10000 ; j++) { calculateSomething(i, j); } } } public void frequentGC () { List<String> list = new ArrayList <>(); for (int i = 0 ; i < 1000000 ; i++) { list.add("String " + i); } } public void blockingOperation () { try { Thread.sleep(10000 ); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } }
4.4.2 优化方案 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 public class OptimizedCode { public void optimizedLoop () { boolean shouldContinue = true ; while (shouldContinue) { doSomething(); if (System.currentTimeMillis() > someTimestamp) { shouldContinue = false ; } } } public void efficientAlgorithm () { Map<Integer, Integer> cache = new HashMap <>(); for (int i = 0 ; i < 10000 ; i++) { Integer result = cache.get(i); if (result == null ) { result = calculateSomething(i); cache.put(i, result); } } } public void reducedObjectCreation () { StringBuilder sb = new StringBuilder (); for (int i = 0 ; i < 1000000 ; i++) { sb.append("String " ).append(i); } } public void asyncOperation () { CompletableFuture.runAsync(() -> { doTimeConsumingWork(); }); } }
五、Arthas高级功能应用 5.1 批量操作与脚本 5.1.1 批量命令执行 1 2 3 4 5 6 7 8 9 10 11 12 cat > batch_commands.txt << EOF thread -b monitor -c 5 com.example.service.UserService getUserById watch com.example.service.UserService getUserById '{params,returnObj,#cost}' profiler start sleep 30 profiler stop --format html --file /tmp/profile.html EOF java -jar arthas-boot.jar -c -f batch_commands.txt
5.1.2 脚本化诊断 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 cat > diagnose.js << 'EOF' function diagnose ( ) { var threads = thread (); console .log ("Total threads: " + threads.length ); var maxCpuThread = null ; var maxCpuUsage = 0 ; for (var i = 0 ; i < threads.length ; i++) { var thread = threads[i]; if (thread.cpuUsage > maxCpuUsage) { maxCpuUsage = thread.cpuUsage ; maxCpuThread = thread; } } console .log ("Max CPU usage thread: " + maxCpuThread.id + ", CPU: " + maxCpuUsage + "%" ); var stack = thread (maxCpuThread.id ); console .log ("Thread stack: " + stack); return { maxCpuThread : maxCpuThread, maxCpuUsage : maxCpuUsage, stack : stack }; } var result = diagnose ();EOF # 执行JavaScript 脚本 java -jar arthas-boot.jar -c -f diagnose.js
5.2 自定义命令开发 5.2.1 自定义命令类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 @Name("cpu-analyzer") @Description("CPU分析命令") public class CpuAnalyzerCommand extends AnnotatedCommand { @Option(shortName = "t", longName = "time", description = "分析时间(秒)") private int time = 30 ; @Option(shortName = "f", longName = "file", description = "输出文件") private String file = "/tmp/cpu_analysis.txt" ; @Override public void process (CommandProcess process) { try { process.write("Starting CPU analysis...\n" ); List<ThreadInfo> threads = getThreadInfo(); CpuAnalysisResult result = analyzeCpuUsage(threads); generateReport(result, file); process.write("CPU analysis completed. Report saved to: " + file + "\n" ); } catch (Exception e) { process.write("Error: " + e.getMessage() + "\n" ); } } private List<ThreadInfo> getThreadInfo () { ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); return Arrays.asList(threadBean.getThreadInfo(threadBean.getAllThreadIds())); } private CpuAnalysisResult analyzeCpuUsage (List<ThreadInfo> threads) { CpuAnalysisResult result = new CpuAnalysisResult (); for (ThreadInfo thread : threads) { if (thread.getThreadState() == Thread.State.RUNNABLE) { result.addRunnableThread(thread); } } return result; } private void generateReport (CpuAnalysisResult result, String file) { try (PrintWriter writer = new PrintWriter (new FileWriter (file))) { writer.println("CPU Analysis Report" ); writer.println("==================" ); writer.println("Analysis Time: " + new Date ()); writer.println("Runnable Threads: " + result.getRunnableThreads().size()); for (ThreadInfo thread : result.getRunnableThreads()) { writer.println("Thread: " + thread.getThreadName() + ", State: " + thread.getThreadState()); } } catch (IOException e) { throw new RuntimeException ("Failed to generate report" , e); } } }
5.2.2 命令注册 1 2 3 4 5 6 7 8 9 10 11 12 @Component public class CustomCommandRegistry { @PostConstruct public void registerCustomCommands () { CommandRegistry.getInstance().registerCommand(new CpuAnalyzerCommand ()); } }
5.3 集成监控系统 5.3.1 与Prometheus集成 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 @Component public class ArthasPrometheusIntegration { @Autowired private MeterRegistry meterRegistry; @Scheduled(fixedRate = 30000) public void collectArthasMetrics () { try { ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); ThreadInfo[] threads = threadBean.getThreadInfo(threadBean.getAllThreadIds()); Map<Thread.State, Long> threadStateCount = Arrays.stream(threads) .collect(Collectors.groupingBy( ThreadInfo::getThreadState, Collectors.counting())); for (Map.Entry<Thread.State, Long> entry : threadStateCount.entrySet()) { Gauge.builder("arthas.threads.state" ) .description("线程状态统计" ) .tag("state" , entry.getKey().name()) .register(meterRegistry, () -> entry.getValue()); } double cpuUsage = getCpuUsage(); Gauge.builder("arthas.cpu.usage" ) .description("CPU使用率" ) .register(meterRegistry, () -> cpuUsage); } catch (Exception e) { log.error("收集Arthas指标失败" , e); } } private double getCpuUsage () { OperatingSystemMXBean osBean = ManagementFactory.getOperatingSystemMXBean(); return osBean.getProcessCpuLoad() * 100 ; } }
5.3.2 与Grafana集成 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 { "dashboard" : { "title" : "Arthas CPU监控" , "panels" : [ { "title" : "CPU使用率" , "type" : "graph" , "targets" : [ { "expr" : "arthas_cpu_usage" , "legendFormat" : "CPU使用率" } ] } , { "title" : "线程状态统计" , "type" : "graph" , "targets" : [ { "expr" : "arthas_threads_state" , "legendFormat" : "{{state}}" } ] } ] } }
六、企业级Arthas应用实践 6.1 生产环境部署 6.1.1 安全配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 arthas.security.enabled=true arthas.security.username=admin arthas.security.password=password123 arthas.security.ip.whitelist=192.168.1.0/24,10.0.0.0/8 arthas.security.command.allow=thread,monitor,watch arthas.security.command.deny=ognl,sc,sm arthas.security.session.timeout=3600000 arthas.security.audit.enabled=true arthas.security.audit.log.file=/var/log/arthas/audit.log
6.1.2 高可用部署 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 apiVersion: apps/v1 kind: Deployment metadata: name: arthas-diagnostic spec: replicas: 3 selector: matchLabels: app: arthas-diagnostic template: metadata: labels: app: arthas-diagnostic spec: containers: - name: arthas image: arthas:latest ports: - containerPort: 8080 env: - name: ARTHAS_HOME value: /opt/arthas - name: JAVA_OPTS value: "-Xms512m -Xmx1g" volumeMounts: - name: arthas-config mountPath: /opt/arthas/conf - name: arthas-logs mountPath: /var/log/arthas volumes: - name: arthas-config configMap: name: arthas-config - name: arthas-logs persistentVolumeClaim: claimName: arthas-logs-pvc
6.2 监控告警集成 6.2.1 告警规则配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 groups: - name: arthas_alerts rules: - alert: HighCPUUsage expr: arthas_cpu_usage > 80 for: 2m labels: severity: warning annotations: summary: "CPU使用率过高" description: "CPU使用率超过80%,当前值: {{ $value }} %" - alert: ThreadDeadlock expr: arthas_threads_state{state="DEADLOCK"} > 0 for: 0m labels: severity: critical annotations: summary: "检测到线程死锁" description: "检测到{{ $value }} 个死锁线程" - alert: HighThreadCount expr: sum(arthas_threads_state) > 1000 for: 5m labels: severity: warning annotations: summary: "线程数量过多" description: "线程数量超过1000,当前值: {{ $value }} "
6.2.2 自动诊断脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 #!/bin/bash ARTHAS_HOME="/opt/arthas" LOG_FILE="/var/log/arthas/auto_diagnose.log" PID_FILE="/var/run/arthas.pid" log () { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1 " >> $LOG_FILE } check_cpu_usage () { local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) if (( $(echo "$cpu_usage > 80 " | bc -l) )); then log "CPU使用率过高: $cpu_usage %" return 1 fi return 0 } start_arthas_diagnosis () { log "启动Arthas诊断" java -jar $ARTHAS_HOME /arthas-boot.jar --target-pid $1 & echo $! > $PID_FILE sleep 10 java -jar $ARTHAS_HOME /arthas-boot.jar -c "thread -b" >> $LOG_FILE java -jar $ARTHAS_HOME /arthas-boot.jar -c "profiler start" >> $LOG_FILE sleep 30 java -jar $ARTHAS_HOME /arthas-boot.jar -c "profiler stop --format html --file /tmp/cpu_profile.html" >> $LOG_FILE log "Arthas诊断完成" } cleanup () { if [ -f $PID_FILE ]; then local pid=$(cat $PID_FILE ) kill $pid 2>/dev/null rm -f $PID_FILE fi } main () { log "开始自动诊断" if ! check_cpu_usage; then local java_pid=$(pgrep -f "java.*your-app" ) if [ -n "$java_pid " ]; then start_arthas_diagnosis $java_pid else log "未找到Java进程" fi else log "CPU使用率正常" fi cleanup log "自动诊断结束" } trap cleanup EXITmain
6.3 最佳实践总结 6.3.1 使用建议 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
6.3.2 故障处理流程
graph TD
A[性能问题发现] --> B[连接Arthas]
B --> C[快速诊断]
C --> D[定位问题]
D --> E[分析根因]
E --> F[制定方案]
F --> G[实施优化]
G --> H[验证效果]
H --> I[问题解决]
I --> J[经验总结]
J --> K[知识沉淀]
七、总结 Arthas作为Java应用诊断的利器,能够在3秒内快速定位CPU飙升问题,大大提高了故障排查效率。通过系统性的学习Arthas的各种命令和功能,结合企业级的最佳实践,可以构建完整的性能诊断体系,保障系统的稳定运行。
7.1 关键要点
快速连接 :掌握Arthas的快速安装和连接方法
核心命令 :熟练使用thread、profiler、monitor、watch等核心命令
性能分析 :学会使用火焰图、调用链分析等高级功能
企业应用 :掌握生产环境的安全配置和监控集成
最佳实践 :建立完整的故障诊断和处理流程
7.2 最佳实践
3秒定位 :使用thread -b命令快速找到最繁忙的线程
深度分析 :使用profiler命令生成火焰图进行深度分析
实时监控 :使用monitor和watch命令进行实时监控
批量操作 :使用脚本实现批量诊断和自动化处理
知识积累 :建立诊断知识库,积累故障处理经验
通过Arthas的强大功能,我们可以快速定位和解决CPU飙升问题,提高系统性能和稳定性,为业务发展提供有力保障。