centos k8s安装后如何监控集群状态 - 问答

kubectl是Kubernetes原生命令行工具，可直接与集群API交互，快速获取集群基础状态信息，适用于日常巡检和快速故障排查。

查看节点状态：执行kubectl get nodes，输出包含节点名称、状态（Ready表示正常，NotReady需排查）、版本、内存/CPU资源等信息，是判断节点健康的第一步。
查看Pod状态：执行kubectl get pods --all-namespaces，可查看所有命名空间下的Pod名称、状态（Running表示运行中，Pending表示挂起，Error表示异常）、所在节点及重启次数。若Pod状态异常，可通过kubectl describe pod <pod-name> -n <namespace>查看详细原因（如镜像拉取失败、资源不足）。
查看Deployment状态：执行kubectl get deployments --all-namespaces，查看Deployment的副本数、可用副本数（Available）、更新状态等，确保应用按预期部署。
查看集群事件：执行kubectl get events --all-namespaces，可获取集群近期事件（如节点加入、Pod调度失败、资源不足警告），帮助快速定位问题根源。
查看资源使用趋势：若需查看节点/Pod的CPU、内存实时使用情况，需提前安装Metrics Server（轻量级指标聚合器），执行kubectl top nodes或kubectl top pods即可查看。

Kubernetes Dashboard是官方提供的Web界面，提供直观的集群资源管理和状态监控，适合新手或不熟悉命令行的用户。

安装Dashboard：执行kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml，部署Dashboard所需的Deployment、Service等资源。
访问Dashboard：
- 方式1（集群内访问）：执行kubectl proxy，然后在浏览器访问http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/。
- 方式2（本地访问）：通过kubectl -n kubernetes-dashboard port-forward svc/kubernetes-dashboard 8443:443将Dashboard端口转发到本地，再访问https://localhost:8443。
登录与使用：生产环境需创建专用Service Account并绑定cluster-admin权限（避免使用默认凭据）。登录后，可通过左侧导航栏查看节点、Pod、Deployment、Service等资源的状态，支持筛选、排序和详情查看。

Prometheus+Grafana是Kubernetes生态中最流行的监控解决方案，提供全面的指标采集、存储和可视化，支持报警功能，适合生产环境。

安装Prometheus：推荐使用Helm安装（简化配置），执行helm repo add prometheus-community https://prometheus-community.github.io/helm-charts（添加仓库），helm install prometheus prometheus-community/kube-prometheus-stack（安装kube-prometheus-stack，包含Prometheus、Alertmanager、Grafana等组件）。
安装Grafana：通过Helm安装helm install grafana prometheus-community/grafana，安装完成后执行kubectl port-forward svc/grafana 3000:80，在浏览器访问http://localhost:3000，默认用户名/密码为admin/admin。
配置数据源与仪表盘：
- 在Grafana中添加Prometheus数据源（地址为http://prometheus-operated.monitoring.svc.cluster.local:9090）。
- 导入官方Kubernetes监控仪表盘（如ID为3119的“Kubernetes Cluster Monitoring”仪表盘），可查看节点资源使用率、Pod状态分布、Deployment滚动更新情况等指标。
报警配置：通过Prometheus的Alertmanager设置报警规则（如节点宕机、Pod持续Error、资源使用率超过阈值），支持邮件、Slack等方式通知运维人员。

0 赞

0 踩