Kubernetes核心组件故障排查完整指南 故障分类和诊断流程 故障分类体系 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 graph TB subgraph "Kubernetes故障分类" A[API Server问题] --> B[etcd问题] B --> C[Container Runtime问题] C --> D[网络问题] D --> E[存储问题] E --> F[配置问题] F --> A end subgraph "诊断流程" G[收集信息] --> H[分析日志] H --> I[检查状态] I --> J[定位原因] J --> K[制定方案] K --> L[执行修复] L --> M[验证结果] end
通用诊断工具集 综合诊断脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 #!/bin/bash echo "🩺 Kubernetes故障诊断工具" echo "==========================" RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' check_status () { local service=$1 local cmd=$2 echo -n "检查 $service : " if eval $cmd >/dev/null 2>&1; then echo -e "${GREEN} ✓${NC} " return 0 else echo -e "${RED} ✗${NC} " return 1 fi } show_details () { local title=$1 local cmd=$2 echo -e "\n${BLUE} === $title ===${NC} " eval $cmd } echo "1. 基础连接测试" echo "---------------" check_status "kubectl连接" "kubectl cluster-info" check_status "API Server健康" "kubectl get --raw='/healthz'" check_status "节点状态" "kubectl get nodes | grep -v NotReady" check_status "系统Pod状态" "kubectl get pods -n kube-system | grep -v Running | grep -v Completed | grep -v Succeeded" echo -e "\n2. 组件详细状态" echo "---------------" show_details "API Server状态" "kubectl get pods -n kube-system -l component=kube-apiserver" show_details "etcd状态" "kubectl get pods -n kube-system -l component=etcd" show_details "Controller Manager状态" "kubectl get pods -n kube-system -l component=kube-controller-manager" show_details "Scheduler状态" "kubectl get pods -n kube-system -l component=kube-scheduler" echo -e "\n3. 资源使用情况" echo "---------------" show_details "节点资源使用" "kubectl top nodes" show_details "Pod资源使用" "kubectl top pods -A | head -20" echo -e "\n4. 事件信息" echo "-----------" show_details "最近事件" "kubectl get events --sort-by=.metadata.creationTimestamp | tail -20" echo -e "\n诊断完成!"
API Server故障排查 常见问题诊断 1. API Server无法启动 症状识别:
1 2 3 4 5 6 7 kubectl get pods -n kube-system -l component=kube-apiserver systemctl status kubelet journalctl -u kubelet -f kubectl logs -n kube-system <apiserver-pod-name>
常见原因和解决方案:
问题
症状
解决方案
证书问题
TLS握手失败
检查证书有效期和权限
etcd连接失败
connection refused
验证etcd服务状态
端口冲突
address already in use
检查端口占用情况
配置错误
invalid configuration
验证配置文件语法
详细排查步骤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout sudo openssl verify -CAfile /etc/kubernetes/pki/ca.crt /etc/kubernetes/pki/apiserver.crt ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health sudo kubeadm config view sudo kube-apiserver --config=/etc/kubernetes/manifests/kube-apiserver.yaml --dry-run
2. API Server响应慢 性能诊断工具:
1 2 3 4 5 6 7 8 kubectl get --raw /metrics | grep -E "(apiserver_request_duration|apiserver_request_total)" kubectl get --raw /metrics | grep -E "(rest_client_request_duration|rest_client_requests_total)" kubectl get --raw /metrics | grep -E "etcd_request_duration"
优化建议:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: v1 kind: Pod metadata: name: kube-apiserver spec: containers: - name: kube-apiserver command: - kube-apiserver - --max-requests-inflight=3000 - --max-mutating-requests-inflight=1000 - --request-timeout=300s - --watch-cache-sizes=100
etcd故障排查 诊断脚本详解 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 #!/bin/bash ETCD_ENDPOINTS="https://127.0.0.1:2379" ETCD_CACERT="/etc/kubernetes/pki/etcd/ca.crt" ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt" ETCD_KEY="/etc/kubernetes/pki/etcd/server.key" echo "🔧 etcd故障排查工具" echo "==================" echo "1. etcd健康状态" echo "---------------" ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=$ETCD_ENDPOINTS \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY echo -e "\n2. 集群成员状态" echo "---------------" ETCDCTL_API=3 etcdctl member list -w table \ --endpoints=$ETCD_ENDPOINTS \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY echo -e "\n3. 性能检测" echo "----------" ETCDCTL_API=3 etcdctl check perf \ --endpoints=$ETCD_ENDPOINTS \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY echo -e "\n4. 数据库状态" echo "----------" ETCDCTL_API=3 etcdctl endpoint status -w table \ --endpoints=$ETCD_ENDPOINTS \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY echo -e "\n5. 告警状态" echo "----------" ETCDCTL_API=3 etcdctl alarm list \ --endpoints=$ETCD_ENDPOINTS \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY echo -e "\netcd诊断完成!"
常见etcd问题 1. 存储空间不足 问题症状:
1 2 3 4 5 ETCDCTL_API=3 etcdctl alarm list ETCDCTL_API=3 etcdctl endpoint status -w table
解决方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 rev=$(ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | \ jq '.[] | .Status.header.revision' ) ETCDCTL_API=3 etcdctl compact $rev ETCDCTL_API=3 etcdctl defrag --cluster ETCDCTL_API=3 etcdctl alarm disarm etcd --quota-backend-bytes=8589934592
2. 集群脑裂 诊断脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 #!/bin/bash echo "检查etcd集群脑裂状态..." ENDPOINTS=( "https://10.0.0.1:2379" "https://10.0.0.2:2379" "https://10.0.0.3:2379" ) for endpoint in "${ENDPOINTS[@]} " ; do echo "检查节点: $endpoint " status=$(ETCDCTL_API=3 etcdctl endpoint status \ --endpoints=$endpoint \ --cacert=$ETCD_CACERT \ --cert=$ETCD_CERT \ --key=$ETCD_KEY \ --write-out=json 2>/dev/null) if [ $? -eq 0 ]; then leader=$(echo $status | jq -r '.[0].Status.leader' ) term=$(echo $status | jq -r '.[0].Status.header.revision' ) echo " Leader: $leader , Term: $term " else echo " 节点不可达" fi done
Container Runtime故障排查 containerd诊断 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 #!/bin/bash echo "🐳 Container Runtime诊断工具" echo "============================" echo "1. containerd服务状态" echo "-------------------" systemctl is-active containerd systemctl status containerd --no-pager -l echo -e "\n2. CRI插件状态" echo "-------------" crictl info echo -e "\n3. 运行容器状态" echo "-------------" crictl ps -a | head -10 echo -e "\n4. 镜像状态" echo "----------" crictl images | head -10 echo -e "\n5. 网络插件状态" echo "--------------" ls -la /etc/cni/net.d/cat /etc/cni/net.d/* | head -20echo -e "\ncontainerd诊断完成!"
容器启动失败排查 常见问题诊断流程:
1 2 3 4 5 6 7 8 9 10 flowchart TD A[容器启动失败] --> B{检查镜像} B -->|镜像问题| C[镜像拉取失败] B -->|镜像正常| D{检查资源} D -->|资源不足| E[CPU/内存限制] D -->|资源充足| F{检查配置} F -->|配置错误| G[环境变量/挂载] F -->|配置正常| H{检查网络} H -->|网络问题| I[CNI配置] H -->|网络正常| J[Runtime问题]
详细排查命令:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 kubectl describe pod <pod-name> kubectl logs <pod-name> -c <container-name> crictl inspecti <image-id> crictl inspect <container-id> kubectl exec -it <pod-name> -- /bin/sh kubectl top node kubectl describe node <node-name>
网络问题排查 Pod网络诊断 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 #!/bin/bash POD_NAME=$1 NAMESPACE=${2:-default} echo "🌐 Pod网络诊断: $POD_NAME " echo "=========================" echo "1. Pod网络信息" echo "-------------" kubectl get pod $POD_NAME -n $NAMESPACE -o wide echo -e "\n2. Service关联" echo "-------------" kubectl get svc -n $NAMESPACE echo -e "\n3. DNS解析测试" echo "-------------" kubectl exec $POD_NAME -n $NAMESPACE -- nslookup kubernetes.default.svc.cluster.local echo -e "\n4. 网络连通性" echo "-------------" kubectl exec $POD_NAME -n $NAMESPACE -- ping -c 3 8.8.8.8 echo -e "\n5. CNI插件状态" echo "-------------" ls -la /etc/cni/net.d/cat /etc/cni/net.d/10-*.conf
监控和告警 Prometheus监控配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config data: prometheus.yml: | global: scrape_interval: 15s rule_files: - "/etc/prometheus/rules/*.yml" scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_namespace , __meta_kubernetes_service_name , __meta_kubernetes_endpoint_port_name ] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-etcd' static_configs: - targets: ['10.0.0.1:2379' , '10.0.0.2:2379' , '10.0.0.3:2379' ] scheme: https tls_config: ca_file: /etc/ssl/etcd/ca.pem cert_file: /etc/ssl/etcd/client.pem key_file: /etc/ssl/etcd/client-key.pem - job_name: 'kubernetes-kubelet' kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+)
告警规则配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 groups: - name: kubernetes.alerts rules: - alert: KubernetesAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes API Server is down" description: "Kubernetes API Server has been down for more than 5 minutes" - alert: KubernetesAPIServerHighRequestLatency expr: histogram_quantile(0.99, apiserver_request_duration_seconds_bucket) > 1 for: 5m labels: severity: warning annotations: summary: "High API Server request latency" - alert: EtcdClusterDown expr: up{job="kubernetes-etcd"} == 0 for: 5m labels: severity: critical annotations: summary: "etcd cluster is down" - alert: EtcdHighNumberOfLeaderChanges expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3 for: 5m labels: severity: warning annotations: summary: "High number of leader changes" - alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes node not ready"
预防性维护 健康检查脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 #!/bin/bash LOG_FILE="/var/log/kubernetes-health-check.log" log_message () { echo "$(date) : $1 " >> $LOG_FILE } check_cluster_health () { log_message "开始集群健康检查..." if kubectl cluster-info >/dev/null 2>&1; then log_message "✓ API Server健康" else log_message "✗ API Server异常" return 1 fi not_ready_nodes=$(kubectl get nodes | grep NotReady | wc -l) if [ $not_ready_nodes -eq 0 ]; then log_message "✓ 所有节点就绪" else log_message "✗ $not_ready_nodes 个节点未就绪" fi failed_pods=$(kubectl get pods -n kube-system | grep -E "(Error|CrashLoopBackOff|Failed)" | wc -l) if [ $failed_pods -eq 0 ]; then log_message "✓ 系统Pod正常" else log_message "✗ $failed_pods 个系统Pod异常" fi } check_performance () { log_message "开始性能检查..." latency=$(kubectl get --raw /metrics 2>/dev/null | \ grep apiserver_request_duration_seconds | head -1) log_message "API Server延迟指标: $latency " etcd_latency=$(ETCDCTL_API=3 etcdctl check perf 2>/dev/null | \ grep "PASS" | wc -l) if [ $etcd_latency -gt 0 ]; then log_message "✓ etcd性能正常" else log_message "✗ etcd性能异常" fi } check_cluster_health check_performance log_message "健康检查完成"
自动修复脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #!/bin/bash restart_failed_pods () { failed_pods=$(kubectl get pods -A | grep -E "(Error|CrashLoopBackOff)" | awk '{print $1","$2}' ) for pod_info in $failed_pods ; do namespace=$(echo $pod_info | cut -d',' -f1) pod=$(echo $pod_info | cut -d',' -f2) echo "重启失败的Pod: $namespace /$pod " kubectl delete pod $pod -n $namespace done } cleanup_evicted_pods () { echo "清理驱逐的Pod..." kubectl get pods -A | grep Evicted | awk '{print $2 " -n " $1}' | xargs kubectl delete pod } compact_etcd () { echo "压缩etcd数据..." rev=$(ETCDCTL_API=3 etcdctl endpoint status --write-out="json" | \ jq '.[] | .Status.header.revision' ) if [ $rev -gt 100000 ]; then ETCDCTL_API=3 etcdctl compact $rev echo "etcd压缩完成,压缩到版本: $rev " fi } restart_failed_pods cleanup_evicted_pods compact_etcd echo "自动修复完成"
这是Kubernetes故障排查的完整指南,涵盖了从诊断到修复的全流程。建议将这些脚本保存到运维工具库中,并定期执行健康检查。
系列文章导航: