Debian系统上Kubernetes的资源调度策略 - 问答

Debian系统上Kubernetes资源调度策略

一调度总览与基础

在 Kubernetes 中，调度的基本单位是 Pod。调度器先通过一组规则进行预选（过滤不满足资源与约束的节点），再对候选节点进行优选（打分选择最优节点），最后完成绑定。为让调度器准确评估节点资源，务必为每个容器设置合理的 resources.requests/limits（CPU/内存），否则可能出现资源争用或 OOM。在 Debian 主机上，这一流程与操作系统无关，关键在于正确配置 Pod 与调度器策略。

二常用调度策略与适用场景

下表汇总了常用策略、作用与要点（适用于在 Debian 节点上运行的集群）：

策略	作用	关键字段	典型场景
nodeSelector	按节点标签强制选择	spec.nodeSelector	将负载调度到具备特定标签（如 disktype=ssd）的节点
nodeAffinity	节点级亲和/偏好，支持硬/软约束	spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution / preferredDuringSchedulingIgnoredDuringExecution	将 GPU/高性能节点优先分配给计算型负载
podAffinity / podAntiAffinity	基于已运行 Pod 的标签进行亲和/反亲和	spec.affinity.podAffinity / podAntiAffinity	将前端与后端靠近（降低时延），或将同类副本分散（提升高可用）
Taints / Tolerations	节点“排斥/容忍”，实现专用/隔离	spec.taints / spec.tolerations	将 GPU 节点打污点仅允许带容忍的 Pod 调度；节点维护时驱逐
拓扑分散与均衡	跨节点/拓扑域分散或装箱	topologyKey（如 kubernetes.io/hostname）、NodeResourcesBalancedAllocation 等	提升容灾与资源利用均衡

说明：
- 亲和性规则支持“必须满足”（硬约束）与“尽量满足”（软约束）两类，便于在约束与弹性之间平衡。
- 在大规模集群中，过度使用 podAffinity/podAntiAffinity 会显著增加调度耗时，需谨慎评估。

三关键配置示例

节点亲和与反亲和（GPU 示例 + 分散副本）

给节点打标签：kubectl label node kubernetes.io/gpu=nvidia-gpu

调度到 GPU 节点，并将同类副本分散到不同主机：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-job
spec:
  replicas: 3
  selector: { matchLabels: { app: ml-job } }
  template:
    metadata:
      labels: { app: ml-job }
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/gpu
                operator: In
                values: ["nvidia-gpu"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["ml-job"]
              topologyKey: kubernetes.io/hostname
      containers:
      - name: ml-container
        image: my-ml-image:latest
        resources:
          requests: { cpu: "2", memory: "8Gi" }
          limits:   { cpu: "4", memory: "16Gi" }

污点与容忍（专用 GPU 节点）

给节点打污点：kubectl taint node gpu=true:NoSchedule

仅允许带容忍的 Pod 调度：

spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: gpu-worker
    image: nvidia/cuda:12.2-base

资源请求与 HPA（基于 CPU 利用率）

设置资源请求，使调度器可感知负载；结合 HPA 自动扩缩：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 2
  selector: { matchLabels: { app: web } }
  template:
    metadata:
      labels: { app: web }
    spec:
      containers:
      - name: app
        image: nginx:1.25
        resources:
          requests: { cpu: "500m", memory: "512Mi" }
          limits:   { cpu: "1",   memory: "1Gi" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

调度器均衡策略（可选）

通过 KubeSchedulerConfiguration 提高资源均衡权重，缓解热点：

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesBalancedAllocation
        weight: 2

将配置写入 /etc/kubernetes/scheduler-config.yaml 并在 kube-scheduler 启动参数中引用（–config）。

四在 Debian 上的实施与优化建议

内核与网络
- 适度优化 /etc/sysctl.conf（如 vm.swappiness=10、TCP 队列与端口范围等）并 sysctl -p 生效，减少抖动与连接瓶颈；为 Jumbo Frames 场景设置合适 MTU=9000（需链路端到端一致）。
存储与 CNI
- 优先使用 高性能 SSD/NVMe 与本地盘；选择高性能 CNI（如 Calico/Flannel/Weave），并依据 MTU 与网络策略减少额外开销。
资源与配额
- 为所有关键容器设置 requests/limits，避免“无请求”导致调度失真与 OOM；在命名空间级使用 ResourceQuota/LimitRange 做容量与配额治理。
监控与迭代
- 使用 Prometheus + Grafana 观察节点与 Pod 的 CPU/内存/利用率 与调度指标，结合业务峰谷周期性调整 requests/limits、HPA 阈值、亲和/反亲和与污点策略。

0 赞

0 踩