Skip to content

资源管理

概述

Kubernetes资源管理是确保集群稳定运行的关键机制。通过Requests和Limits、ResourceQuota、LimitRange等机制,可以有效地分配、限制和监控集群资源,防止资源争抢和浪费。

核心概念

资源类型

  • 计算资源:CPU、内存、 ephemeral-storage
  • 扩展资源:GPU、FPGA等特殊硬件资源
  • 存储资源:持久化存储、临时存储

资源管理机制

  • Requests:容器启动所需的最小资源量,用于调度决策
  • Limits:容器能使用的最大资源量,用于运行时限制
  • ResourceQuota:命名空间级别的资源配额限制
  • LimitRange:Pod或Container级别的默认资源限制
  • QoS(服务质量):根据资源配置自动分配的服务等级

Requests和Limits

基本概念

yaml
apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:        # 调度依据
        cpu: "250m"    # 0.25核
        memory: "64Mi" # 64MB
      limits:          # 运行时限制
        cpu: "500m"    # 0.5核
        memory: "128Mi" # 128MB

CPU资源单位

  • 1 CPU = 1 AWS vCPU / 1 GCP Core / 1 Azure vCore
  • 0.5 CPU = 500m(毫核)
  • 100m = 0.1核 = 10%的CPU时间

内存资源单位

  • 字节:无单位后缀
  • Ki/Mi/Gi/Ti:二进制单位(1024进制)
  • K/M/G/T:十进制单位(1000进制)

Requests和Limits详解

yaml
apiVersion: v1
kind: Pod
metadata:
  name: resource-example
spec:
  containers:
  - name: app
    image: myapp:v1
    resources:
      requests:
        cpu: "100m"      # 最小0.1核
        memory: "128Mi"  # 最小128MB
        ephemeral-storage: "1Gi"  # 最小1GB临时存储
      limits:
        cpu: "200m"      # 最大0.2核
        memory: "256Mi"  # 最大256MB
        ephemeral-storage: "2Gi"  # 最大2GB临时存储

资源分配机制

yaml
# 场景1:只设置Requests
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
# 结果:调度时保证最小资源,运行时可使用更多资源(无上限)

# 场景2:只设置Limits
resources:
  limits:
    cpu: "200m"
    memory: "256Mi"
# 结果:Requests默认等于Limits(Guaranteed QoS)

# 场景3:同时设置Requests和Limits
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"
# 结果:调度时保证最小资源,运行时限制最大资源(Burstable QoS)

# 场景4:都不设置
resources: {}
# 结果:调度时不保证资源,运行时无限制(BestEffort QoS)

多容器Pod资源管理

yaml
apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
spec:
  containers:
  - name: app
    image: myapp:v1
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
  
  - name: sidecar
    image: log-collector:v1
    resources:
      requests:
        cpu: "50m"
        memory: "64Mi"
      limits:
        cpu: "100m"
        memory: "128Mi"
  
  # Pod总资源 = 所有容器资源之和
  # Requests: 250m CPU, 320Mi Memory
  # Limits: 600m CPU, 640Mi Memory

资源超卖(Overcommit)

yaml
# 节点资源分配示例
# 节点总资源:4 CPU, 8Gi Memory

# Pod 1
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# Pod 2
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# Pod 3
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# 结果:
# - Requests总和:1.5 CPU, 3Gi Memory(可调度)
# - Limits总和:3 CPU, 6Gi Memory(允许超卖)
# - 超卖比例:CPU 75%, Memory 75%

ResourceQuota资源配额

基本配置

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"        # CPU请求总量不超过4核
    requests.memory: 8Gi     # 内存请求总量不超过8GB
    limits.cpu: "8"          # CPU限制总量不超过8核
    limits.memory: 16Gi      # 内存限制总量不超过16GB
    pods: "10"               # Pod数量不超过10个
    persistentvolumeclaims: "5"  # PVC数量不超过5个
    requests.storage: "50Gi" # 存储请求总量不超过50GB

完整资源配额示例

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: full-quota
  namespace: production
spec:
  hard:
    # 计算资源
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    
    # 存储资源
    persistentvolumeclaims: "10"
    requests.storage: "100Gi"
    
    # 对象数量
    pods: "50"
    services: "10"
    secrets: "20"
    configmaps: "20"
    replicationcontrollers: "5"
    
    # 特定类型的资源
    count/deployments.apps: "10"
    count/statefulsets.apps: "5"
    count/jobs.batch: "20"
    count/cronjobs.batch: "10"

按资源类型配额

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-high
  namespace: development
spec:
  hard:
    cpu: "10"
    memory: "20Gi"
    pods: "10"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - high
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-medium
  namespace: development
spec:
  hard:
    cpu: "5"
    memory: "10Gi"
    pods: "10"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - medium
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-low
  namespace: development
spec:
  hard:
    cpu: "2"
    memory: "4Gi"
    pods: "5"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - low

配额作用域

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: scoped-quota
  namespace: development
spec:
  hard:
    pods: "10"
    cpu: "5"
    memory: "10Gi"
  scopes:
  - Terminating      # 适用于会终止的Pod
  - BestEffort       # 适用于BestEffort QoS的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: not-terminating-quota
  namespace: development
spec:
  hard:
    pods: "5"
    cpu: "3"
    memory: "6Gi"
  scopes:
  - NotTerminating   # 适用于不会终止的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: terminating-quota
  namespace: development
spec:
  hard:
    pods: "5"
    cpu: "2"
    memory: "4Gi"
  scopes:
  - Terminating      # 适用于会终止的Pod

LimitRange默认限制

基本配置

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-mem-limit-range
  namespace: development
spec:
  limits:
  - type: Container
    default:          # 默认Limits
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:   # 默认Requests
      cpu: "250m"
      memory: "256Mi"
    max:              # 最大值
      cpu: "2"
      memory: "2Gi"
    min:              # 最小值
      cpu: "50m"
      memory: "64Mi"
    maxLimitRequestRatio:  # Limits/Requests最大比例
      cpu: 4
      memory: 2

完整LimitRange示例

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: comprehensive-limit-range
  namespace: production
spec:
  limits:
  # Container限制
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "200m"
      memory: "256Mi"
    max:
      cpu: "4"
      memory: "8Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
  
  # Pod限制
  - type: Pod
    max:
      cpu: "8"
      memory: "16Gi"
  
  # PVC限制
  - type: PersistentVolumeClaim
    max:
      storage: "50Gi"
    min:
      storage: "1Gi"
  
  # ImageStream限制(OpenShift)
  - type: ImageStream
    max:
      openshift.io/images: "10"

LimitRange应用示例

yaml
# LimitRange配置
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "250m"
      memory: "256Mi"
---
# Pod未指定资源
apiVersion: v1
kind: Pod
metadata:
  name: auto-resource-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    # 未指定resources,自动应用LimitRange默认值
---
# 实际应用的配置
# Requests: cpu=250m, memory=256Mi
# Limits: cpu=500m, memory=512Mi

QoS服务质量

QoS等级分类

yaml
# Guaranteed QoS(最高优先级)
# 条件:Requests == Limits(CPU和内存都设置且相等)
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
# 特点:
# - 资源保证,不会被抢占
# - 内存不足时最后被OOMKilled
# - 适合关键应用

---
# Burstable QoS(中等优先级)
# 条件:至少一个容器设置了Requests或Limits
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
# 特点:
# - 有最小资源保证
# - 内存不足时可能被OOMKilled
# - 适合一般应用

---
# BestEffort QoS(最低优先级)
# 条件:未设置任何Requests和Limits
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
# 特点:
# - 无资源保证
# - 内存不足时最先被OOMKilled
# - 适合批处理、测试任务

QoS与OOM分数

bash
# OOM分数范围:0-1000
# 分数越高,越容易被OOMKilled

# Guaranteed Pod
# oom_score_adj = -997(最低优先级被杀)

# Burstable Pod
# oom_score_adj = min(max(2, 1000 - (1000 * memoryRequest) / memoryCapacity), 999)

# BestEffort Pod
# oom_score_adj = 1000(最高优先级被杀)

# 查看Pod的OOM分数
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'

QoS配置最佳实践

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical
  template:
    metadata:
      labels:
        app: critical
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
    spec:
      containers:
      - name: app
        image: critical-app:v1
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "1Gi"
      priorityClassName: high-priority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical applications"

实用kubectl操作命令

资源查看命令

bash
# 查看节点资源
kubectl top nodes
kubectl describe node <node-name>

# 查看Pod资源使用
kubectl top pods
kubectl top pods --all-namespaces
kubectl top pod <pod-name> --containers

# 查看资源配额
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota <quota-name> -n <namespace>

# 查看LimitRange
kubectl get limitrange -n <namespace>
kubectl describe limitrange <limitrange-name> -n <namespace>

# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'

# 查看Pod资源分配
kubectl get pod <pod-name> -o yaml | grep -A 10 resources

资源管理命令

bash
# 设置资源配额
kubectl create quota compute-quota --hard=cpu=4,memory=8Gi,pods=10 -n development

# 创建LimitRange
kubectl apply -f limitrange.yaml

# 编辑资源配额
kubectl edit resourcequota compute-quota -n development

# 删除资源配额
kubectl delete resourcequota compute-quota -n development

# 查看命名空间资源使用情况
kubectl describe namespace development

资源监控命令

bash
# 监控节点资源
kubectl top nodes --use-protocol-buffers

# 监控Pod资源
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# 持续监控
watch kubectl top nodes
watch kubectl top pods -n production

# 查看资源事件
kubectl get events --field-selector reason=FailedScheduling
kubectl get events --sort-by='.lastTimestamp'

# 查看资源限制
kubectl get pods -o custom-columns=\
'NAME:.metadata.name,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu'

实践示例

示例1:开发环境资源配额管理

yaml
# 开发环境命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: development
  labels:
    environment: development
---
# 资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    limits.cpu: "8"
    limits.memory: "16Gi"
    pods: "20"
    persistentvolumeclaims: "10"
    services: "10"
    secrets: "20"
    configmaps: "20"
---
# 默认资源限制
apiVersion: v1
kind: LimitRange
metadata:
  name: dev-limits
  namespace: development
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "2"
      memory: "2Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
---
# 示例应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dev-app
  namespace: development
spec:
  replicas: 3
  selector:
    matchLabels:
      app: dev-app
  template:
    metadata:
      labels:
        app: dev-app
    spec:
      containers:
      - name: app
        image: nginx:1.20
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

应用场景:限制开发环境的资源使用,防止过度消耗集群资源。

示例2:生产环境多租户资源隔离

yaml
# 租户A命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-a
  labels:
    tenant: a
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    services: "20"
    persistentvolumeclaims: "20"
    requests.storage: "100Gi"
---
# 租户B命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-b
  labels:
    tenant: b
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-b-quota
  namespace: tenant-b
spec:
  hard:
    requests.cpu: "15"
    requests.memory: "30Gi"
    limits.cpu: "30"
    limits.memory: "60Gi"
    pods: "100"
    services: "30"
    persistentvolumeclaims: "30"
    requests.storage: "200Gi"
---
# 租户A的应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-a-app
  namespace: tenant-a
spec:
  replicas: 5
  selector:
    matchLabels:
      app: tenant-a-app
  template:
    metadata:
      labels:
        app: tenant-a-app
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
---
# 租户B的应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-b-app
  namespace: tenant-b
spec:
  replicas: 10
  selector:
    matchLabels:
      app: tenant-b-app
  template:
    metadata:
      labels:
        app: tenant-b-app
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"

应用场景:多租户环境下,为每个租户分配独立的资源配额,实现资源隔离。

示例3:GPU资源管理

yaml
# GPU资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
    pods: "10"
---
# GPU任务
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  namespace: ml-team
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["python", "train.py"]
---
# GPU任务(使用MIG)
apiVersion: v1
kind: Pod
metadata:
  name: mig-pod
  namespace: ml-team
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1
    command: ["python", "train.py"]

应用场景:管理GPU等特殊硬件资源,确保资源合理分配。

故障排查指南

常见问题诊断

1. Pod处于Pending状态(资源不足)

bash
# 查看Pod事件
kubectl describe pod <pod-name>

# 常见错误信息
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  10s   default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

# 排查步骤
# 1. 检查节点资源
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# 2. 检查资源配额
kubectl describe resourcequota -n <namespace>

# 3. 检查Pod资源请求
kubectl get pod <pod-name> -o yaml | grep -A 10 resources

# 解决方案
# - 增加节点资源
# - 调整Pod的Requests
# - 删除不必要的Pod
# - 调整资源配额

2. OOMKilled错误

bash
# 查看Pod状态
kubectl get pod <pod-name> -o wide

# 查看Pod事件
kubectl describe pod <pod-name>

# 常见错误信息
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

# 排查步骤
# 1. 检查内存限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# 2. 查看容器内存使用
kubectl top pod <pod-name> --containers

# 3. 查看容器日志
kubectl logs <pod-name> --previous

# 解决方案
# - 增加内存Limits
# - 优化应用内存使用
# - 检查内存泄漏
# - 调整JVM参数(Java应用)

3. 资源配额超限

bash
# 查看资源配额状态
kubectl describe resourcequota -n <namespace>

# 常见错误信息
Error from server (Forbidden): error when creating "deployment.yaml": deployments.apps is forbidden: exceeded quota: compute-quota, requested: requests.cpu=500m, used: requests.cpu=4, limited: requests.cpu=4

# 排查步骤
# 1. 查看当前资源使用
kubectl get resourcequota -n <namespace> -o yaml

# 2. 查看命名空间中的资源
kubectl get all -n <namespace>

# 3. 查看资源使用详情
kubectl top pods -n <namespace>

# 解决方案
# - 删除不必要的资源
# - 调整资源配额
# - 优化资源分配

4. CPU节流(Throttling)

bash
# 查看容器CPU使用
kubectl top pod <pod-name> --containers

# 查看容器CPU限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.cpu}'

# 检查CPU节流
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.stat

# 常见输出
nr_periods 12345
nr_throttled 6789
throttled_time 1234567890

# 解决方案
# - 增加CPU Limits
# - 优化应用性能
# - 调整Requests和Limits比例

资源监控脚本

bash
#!/bin/bash
# 资源监控脚本

echo "=== 集群资源概览 ==="
kubectl top nodes

echo -e "\n=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20

echo -e "\n=== 资源配额状态 ==="
for ns in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
  quota=$(kubectl get resourcequota -n $ns -o name 2>/dev/null)
  if [ -n "$quota" ]; then
    echo "Namespace: $ns"
    kubectl describe resourcequota -n $ns | grep -A 20 "Used\|Hard"
  fi
done

echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

echo -e "\n=== OOMKilled Pods ==="
kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | .metadata.name'

资源优化建议

yaml
# 优化前
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "4"
    memory: "8Gi"
# 问题:Requests和Limits差距过大,资源浪费

# 优化后
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"
# 建议:Requests和Limits比例控制在2-4倍

最佳实践建议

1. 资源配置策略

yaml
# 生产环境推荐配置
resources:
  requests:
    cpu: "500m"      # 基于实际使用量的80%
    memory: "512Mi"  # 基于实际使用量的80%
  limits:
    cpu: "1"         # Requests的2倍
    memory: "1Gi"    # Requests的2倍

# 开发环境推荐配置
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# 测试环境推荐配置
resources:
  requests:
    cpu: "50m"
    memory: "64Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

2. 资源配额规划

yaml
# 命名空间资源配额规划
# 总集群资源:100 CPU, 200Gi Memory

# 生产环境(50%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    limits.cpu: "100"
    limits.memory: "200Gi"

# 预发布环境(20%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: staging-quota
  namespace: staging
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"

# 开发环境(30%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: development-quota
  namespace: development
spec:
  hard:
    requests.cpu: "30"
    requests.memory: "60Gi"
    limits.cpu: "60"
    limits.memory: "120Gi"

3. QoS最佳实践

yaml
# 关键应用:Guaranteed QoS
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "1"      # 与requests相同
            memory: "1Gi"  # 与requests相同
      priorityClassName: high-priority

# 一般应用:Burstable QoS
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"   # 大于requests
            memory: "512Mi" # 大于requests

# 批处理任务:BestEffort QoS
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    spec:
      containers:
      - name: batch
        # 不设置resources

4. 资源监控和告警

yaml
# Prometheus规则示例
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: resource-alerts
spec:
  groups:
  - name: resource-alerts
    rules:
    - alert: NodeMemoryUsageHigh
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node memory usage is high"
        description: "Node {{ $labels.instance }} memory usage is {{ $value }}%"
    
    - alert: PodCPUThrottlingHigh
      expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod CPU throttling is high"
        description: "Pod {{ $labels.pod }} CPU throttling is {{ $value }}"
    
    - alert: NamespaceQuotaExceeded
      expr: kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Namespace quota is almost exceeded"
        description: "Namespace {{ $labels.namespace }} quota usage is {{ $value }}"

5. 资源优化清单

markdown
## 资源优化检查清单

### Pod级别
- [ ] 为所有容器设置Requests和Limits
- [ ] Requests和Limits比例合理(2-4倍)
- [ ] 根据实际使用调整资源配置
- [ ] 设置合适的QoS等级
- [ ] 配置资源监控和告警

### 命名空间级别
- [ ] 设置ResourceQuota限制总资源
- [ ] 配置LimitRange设置默认值
- [ ] 定期审查资源使用情况
- [ ] 清理未使用的资源

### 集群级别
- [ ] 监控节点资源使用率
- [ ] 设置集群自动扩缩容
- [ ] 规划资源超卖比例
- [ ] 实施资源隔离策略

总结

核心要点

  1. Requests和Limits

    • Requests用于调度决策,保证最小资源
    • Limits用于运行时限制,防止资源滥用
    • 合理设置比例,避免资源浪费
  2. 资源配额管理

    • ResourceQuota限制命名空间总资源
    • LimitRange设置默认资源限制
    • 按环境和租户合理分配资源
  3. QoS服务质量

    • Guaranteed:关键应用,资源保证
    • Burstable:一般应用,弹性资源
    • BestEffort:批处理任务,无保证
  4. 监控和优化

    • 持续监控资源使用情况
    • 定期优化资源配置
    • 建立资源告警机制

常用命令速查

bash
# 资源查看
kubectl top nodes
kubectl top pods --all-namespaces
kubectl describe resourcequota -n <namespace>

# 资源配额管理
kubectl create quota <name> --hard=cpu=4,memory=8Gi -n <namespace>
kubectl apply -f limitrange.yaml

# 故障排查
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --field-selector reason=FailedScheduling

# 资源监控
watch kubectl top nodes
kubectl top pods --sort-by=memory

下一步学习

参考资源