Ubuntu Kubernetes故障排查技巧 - 问答

Ubuntu环境下Kubernetes故障排查技巧

一、通用故障排查基础

在Ubuntu节点上排查Kubernetes故障前，需掌握以下基础命令和思路：

查看节点状态：kubectl get nodes 确认节点是否处于Ready状态（若为NotReady，需进一步排查节点服务）；
查看Pod状态：kubectl get pods -A 检查所有命名空间的Pod是否运行正常（Running/Pending/Failed）；
查看详细信息：kubectl describe pod <pod-name> 获取Pod的事件（如镜像拉取失败、资源不足）、容器状态和重启次数；
查看容器日志：kubectl logs <pod-name> -c <container-name>（若Pod有多个容器，需指定-c）定位应用错误；
查看系统日志：Ubuntu节点上的Kubelet日志是关键，使用journalctl -u kubelet -f实时查看Kubelet运行日志（如容器运行时错误、节点资源问题）；
检查磁盘空间：df -h 确认Ubuntu节点磁盘未满（Kubelet或容器运行时可能因磁盘空间不足停止工作）；
检查网络连通性：ping、traceroute或kubectl exec -it <pod-name> -- curl <service-ip> 验证节点与集群其他组件（如API Server、Pod）的通信。

二、Ubuntu节点特定问题排查

Ubuntu作为Kubernetes节点的操作系统，其自身配置问题可能导致集群故障：

Kubelet配置检查：Kubelet主配置文件通常位于/etc/kubernetes/kubelet.conf或/var/lib/kubelet/config.yaml，需确认clusterDNS（如10.96.0.10，需与CoreDNS配置一致）、clusterDomain（如cluster.local）、runtimeEndpoint（若使用containerd，应为unix:///run/containerd/containerd.sock）等参数正确；
磁盘空间清理：若df -h显示/var或/分区空间不足（如使用率超过80%），需清理旧日志（/var/log）、无用镜像（docker system prune或crictl rmi --prune）或临时文件；
Kubelet服务修复：若Kubelet未运行，使用systemctl status kubelet查看状态，systemctl restart kubelet重启服务，若频繁崩溃需检查日志中的OOMKilled（内存不足）或failed to start container runtime（容器运行时问题）等错误；
容器运行时配置：Ubuntu节点常用containerd作为容器运行时，需确认/etc/containerd/config.toml中的sandbox_image（如registry.aliyuncs.com/google_containers/pause:3.9，避免境外镜像拉取失败）、SystemdCgroup（设为true，适配Ubuntu的Systemd管理）等参数正确，修改后执行containerd config default > /etc/containerd/config.toml并重启containerd（systemctl restart containerd）。

三、Pod常见故障及解决

Pod是Kubernetes的最小调度单元，其故障直接影响应用运行：

Pod状态分析：
- Pending：通常因资源不足（CPU/内存请求超过节点可用资源）或镜像拉取失败（如私有仓库未认证、镜像不存在）；
- CrashLoopBackOff：容器启动后立即崩溃，需查看容器日志（kubectl logs <pod-name>）定位应用错误（如代码bug、配置文件缺失）；
- NotReady：容器未通过健康检查（livenessProbe/readinessProbe失败），需检查探针配置（如httpGet路径是否正确、initialDelaySeconds是否足够）。
镜像问题：
- 若kubectl describe pod显示ImagePullBackOff，需确认镜像名称（如nginx:1.25而非nginx）、标签（如latest是否可用）正确，测试本地拉取（docker pull <image>）；
- 私有仓库需在Pod或ServiceAccount中配置imagePullSecrets（参考kubectl create secret docker-registry命令）。
资源限制：
- 若kubectl describe pod显示OOMKilled（内存不足），需调整Pod的resources.limits.memory（如512Mi）；若CPU使用率过高，调整resources.limits.cpu（如500m）；
- 资源请求（requests）应合理设置，避免节点资源碎片化（如requests.cpu: 100m、requests.memory: 128Mi）。
端口配置：
- 若kubectl describe pod显示Ports are not available，需确认容器端口（containerPort）与应用监听端口一致（如Spring Boot应用监听8080，则containerPort应为8080）；
- Service的targetPort需与容器端口一致（如targetPort: 8080对应容器8080端口）。

四、Service无法访问排查

Service是集群内外通信的核心，其无法访问的常见原因及解决步骤：

检查Service状态：kubectl get svc 确认Service的CLUSTER-IP（非None，若为None则为Headless Service）、PORT(S)（如80:30007/TCP，80为Service端口、30007为NodePort）配置正确；
检查Endpoints：kubectl get endpoints <service-name> 确认Endpoints列表包含目标Pod的IP和端口（若为空，说明Service的selector未匹配到Pod）；
Selector匹配：kubectl get pods --show-labels 确认Pod的标签与Service的selector一致（如Service的selector: app=my-app，Pod的标签需包含app: my-app）；
网络策略：kubectl get networkpolicy -A 检查是否有网络策略限制访问（如跨命名空间访问需允许namespaceSelector: {}）；
CoreDNS解析：kubectl get pods -n kube-system -l k8s-app=kube-dns 确认CoreDNS正常运行（无CrashLoopBackOff），测试DNS解析（kubectl exec -it <pod-name> -- nslookup <service-name>.default.svc.cluster.local）；
kube-proxy与网络插件：kubectl get pods -n kube-system -l k8s-app=kube-proxy 确认kube-proxy运行正常，查看kube-proxy日志（kubectl logs <kube-proxy-pod-name>）是否有failed to sync iptables等错误；检查网络插件（如Calico）状态（kubectl get pods -n calico-system），确保网络插件正常工作（如calico-node为Running）。

五、其他常见故障

Token过期：若新节点无法加入集群（kubeadm join报错invalid token），需在Master节点执行kubeadm token create生成新Token，计算新的discovery-token-ca-cert-hash（openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'），将新Token和Hash代入kubeadm join命令重新执行；
卷挂载失败：若Pod报错MountVolume failed，需确认PersistentVolume（PV）和PersistentVolumeClaim（PVC）是否绑定（kubectl get pvc显示Bound），存储类（StorageClass）是否正确，Ubuntu节点上的存储路径（如/mnt/data）是否存在且有读写权限；
应用健康检查失败：若Pod因健康检查失败不断重启，需调整livenessProbe/readinessProbe参数（如initialDelaySeconds设为30，等待应用启动完成；periodSeconds设为10，降低检查频率）。

0 赞

0 踩