资源管理

概述

Kubernetes资源管理是确保集群稳定运行的关键机制。通过Requests和Limits、ResourceQuota、LimitRange等机制，可以有效地分配、限制和监控集群资源，防止资源争抢和浪费。

核心概念

资源类型

计算资源：CPU、内存、 ephemeral-storage
扩展资源：GPU、FPGA等特殊硬件资源
存储资源：持久化存储、临时存储

资源管理机制

Requests：容器启动所需的最小资源量，用于调度决策
Limits：容器能使用的最大资源量，用于运行时限制
ResourceQuota：命名空间级别的资源配额限制
LimitRange：Pod或Container级别的默认资源限制
QoS（服务质量）：根据资源配置自动分配的服务等级

Requests和Limits

基本概念

yaml

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:        # 调度依据
        cpu: "250m"    # 0.25核
        memory: "64Mi" # 64MB
      limits:          # 运行时限制
        cpu: "500m"    # 0.5核
        memory: "128Mi" # 128MB

CPU资源单位

1 CPU = 1 AWS vCPU / 1 GCP Core / 1 Azure vCore
0.5 CPU = 500m（毫核）
100m = 0.1核 = 10%的CPU时间

内存资源单位

字节：无单位后缀
Ki/Mi/Gi/Ti：二进制单位（1024进制）
K/M/G/T：十进制单位（1000进制）

Requests和Limits详解

yaml

apiVersion: v1
kind: Pod
metadata:
  name: resource-example
spec:
  containers:
  - name: app
    image: myapp:v1
    resources:
      requests:
        cpu: "100m"      # 最小0.1核
        memory: "128Mi"  # 最小128MB
        ephemeral-storage: "1Gi"  # 最小1GB临时存储
      limits:
        cpu: "200m"      # 最大0.2核
        memory: "256Mi"  # 最大256MB
        ephemeral-storage: "2Gi"  # 最大2GB临时存储

资源分配机制

yaml

# 场景1：只设置Requests
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
# 结果：调度时保证最小资源，运行时可使用更多资源（无上限）

# 场景2：只设置Limits
resources:
  limits:
    cpu: "200m"
    memory: "256Mi"
# 结果：Requests默认等于Limits（Guaranteed QoS）

# 场景3：同时设置Requests和Limits
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"
# 结果：调度时保证最小资源，运行时限制最大资源（Burstable QoS）

# 场景4：都不设置
resources: {}
# 结果：调度时不保证资源，运行时无限制（BestEffort QoS）

多容器Pod资源管理

yaml

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
spec:
  containers:
  - name: app
    image: myapp:v1
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
  
  - name: sidecar
    image: log-collector:v1
    resources:
      requests:
        cpu: "50m"
        memory: "64Mi"
      limits:
        cpu: "100m"
        memory: "128Mi"
  
  # Pod总资源 = 所有容器资源之和
  # Requests: 250m CPU, 320Mi Memory
  # Limits: 600m CPU, 640Mi Memory

资源超卖（Overcommit）

yaml

# 节点资源分配示例
# 节点总资源：4 CPU, 8Gi Memory

# Pod 1
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# Pod 2
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# Pod 3
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

# 结果：
# - Requests总和：1.5 CPU, 3Gi Memory（可调度）
# - Limits总和：3 CPU, 6Gi Memory（允许超卖）
# - 超卖比例：CPU 75%, Memory 75%

ResourceQuota资源配额

基本配置

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"        # CPU请求总量不超过4核
    requests.memory: 8Gi     # 内存请求总量不超过8GB
    limits.cpu: "8"          # CPU限制总量不超过8核
    limits.memory: 16Gi      # 内存限制总量不超过16GB
    pods: "10"               # Pod数量不超过10个
    persistentvolumeclaims: "5"  # PVC数量不超过5个
    requests.storage: "50Gi" # 存储请求总量不超过50GB

完整资源配额示例

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: full-quota
  namespace: production
spec:
  hard:
    # 计算资源
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    
    # 存储资源
    persistentvolumeclaims: "10"
    requests.storage: "100Gi"
    
    # 对象数量
    pods: "50"
    services: "10"
    secrets: "20"
    configmaps: "20"
    replicationcontrollers: "5"
    
    # 特定类型的资源
    count/deployments.apps: "10"
    count/statefulsets.apps: "5"
    count/jobs.batch: "20"
    count/cronjobs.batch: "10"

按资源类型配额

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-high
  namespace: development
spec:
  hard:
    cpu: "10"
    memory: "20Gi"
    pods: "10"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - high
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-medium
  namespace: development
spec:
  hard:
    cpu: "5"
    memory: "10Gi"
    pods: "10"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - medium
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pods-low
  namespace: development
spec:
  hard:
    cpu: "2"
    memory: "4Gi"
    pods: "5"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - low

配额作用域

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: scoped-quota
  namespace: development
spec:
  hard:
    pods: "10"
    cpu: "5"
    memory: "10Gi"
  scopes:
  - Terminating      # 适用于会终止的Pod
  - BestEffort       # 适用于BestEffort QoS的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: not-terminating-quota
  namespace: development
spec:
  hard:
    pods: "5"
    cpu: "3"
    memory: "6Gi"
  scopes:
  - NotTerminating   # 适用于不会终止的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: terminating-quota
  namespace: development
spec:
  hard:
    pods: "5"
    cpu: "2"
    memory: "4Gi"
  scopes:
  - Terminating      # 适用于会终止的Pod

LimitRange默认限制

基本配置

yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-mem-limit-range
  namespace: development
spec:
  limits:
  - type: Container
    default:          # 默认Limits
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:   # 默认Requests
      cpu: "250m"
      memory: "256Mi"
    max:              # 最大值
      cpu: "2"
      memory: "2Gi"
    min:              # 最小值
      cpu: "50m"
      memory: "64Mi"
    maxLimitRequestRatio:  # Limits/Requests最大比例
      cpu: 4
      memory: 2

完整LimitRange示例

yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: comprehensive-limit-range
  namespace: production
spec:
  limits:
  # Container限制
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "200m"
      memory: "256Mi"
    max:
      cpu: "4"
      memory: "8Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
  
  # Pod限制
  - type: Pod
    max:
      cpu: "8"
      memory: "16Gi"
  
  # PVC限制
  - type: PersistentVolumeClaim
    max:
      storage: "50Gi"
    min:
      storage: "1Gi"
  
  # ImageStream限制（OpenShift）
  - type: ImageStream
    max:
      openshift.io/images: "10"

LimitRange应用示例

yaml

# LimitRange配置
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "250m"
      memory: "256Mi"
---
# Pod未指定资源
apiVersion: v1
kind: Pod
metadata:
  name: auto-resource-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    # 未指定resources，自动应用LimitRange默认值
---
# 实际应用的配置
# Requests: cpu=250m, memory=256Mi
# Limits: cpu=500m, memory=512Mi

QoS服务质量

QoS等级分类

yaml

# Guaranteed QoS（最高优先级）
# 条件：Requests == Limits（CPU和内存都设置且相等）
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
# 特点：
# - 资源保证，不会被抢占
# - 内存不足时最后被OOMKilled
# - 适合关键应用

---
# Burstable QoS（中等优先级）
# 条件：至少一个容器设置了Requests或Limits
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
# 特点：
# - 有最小资源保证
# - 内存不足时可能被OOMKilled
# - 适合一般应用

---
# BestEffort QoS（最低优先级）
# 条件：未设置任何Requests和Limits
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: app
    image: nginx:1.20
# 特点：
# - 无资源保证
# - 内存不足时最先被OOMKilled
# - 适合批处理、测试任务

QoS与OOM分数

bash

# OOM分数范围：0-1000
# 分数越高，越容易被OOMKilled

# Guaranteed Pod
# oom_score_adj = -997（最低优先级被杀）

# Burstable Pod
# oom_score_adj = min(max(2, 1000 - (1000 * memoryRequest) / memoryCapacity), 999)

# BestEffort Pod
# oom_score_adj = 1000（最高优先级被杀）

# 查看Pod的OOM分数
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'

QoS配置最佳实践

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical
  template:
    metadata:
      labels:
        app: critical
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
    spec:
      containers:
      - name: app
        image: critical-app:v1
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "1Gi"
      priorityClassName: high-priority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical applications"

实用kubectl操作命令

资源查看命令

bash

# 查看节点资源
kubectl top nodes
kubectl describe node <node-name>

# 查看Pod资源使用
kubectl top pods
kubectl top pods --all-namespaces
kubectl top pod <pod-name> --containers

# 查看资源配额
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota <quota-name> -n <namespace>

# 查看LimitRange
kubectl get limitrange -n <namespace>
kubectl describe limitrange <limitrange-name> -n <namespace>

# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'

# 查看Pod资源分配
kubectl get pod <pod-name> -o yaml | grep -A 10 resources

资源管理命令

bash

# 设置资源配额
kubectl create quota compute-quota --hard=cpu=4,memory=8Gi,pods=10 -n development

# 创建LimitRange
kubectl apply -f limitrange.yaml

# 编辑资源配额
kubectl edit resourcequota compute-quota -n development

# 删除资源配额
kubectl delete resourcequota compute-quota -n development

# 查看命名空间资源使用情况
kubectl describe namespace development

资源监控命令

bash

# 监控节点资源
kubectl top nodes --use-protocol-buffers

# 监控Pod资源
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# 持续监控
watch kubectl top nodes
watch kubectl top pods -n production

# 查看资源事件
kubectl get events --field-selector reason=FailedScheduling
kubectl get events --sort-by='.lastTimestamp'

# 查看资源限制
kubectl get pods -o custom-columns=\
'NAME:.metadata.name,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu'

实践示例

示例1：开发环境资源配额管理

yaml

# 开发环境命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: development
  labels:
    environment: development
---
# 资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: development
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    limits.cpu: "8"
    limits.memory: "16Gi"
    pods: "20"
    persistentvolumeclaims: "10"
    services: "10"
    secrets: "20"
    configmaps: "20"
---
# 默认资源限制
apiVersion: v1
kind: LimitRange
metadata:
  name: dev-limits
  namespace: development
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "2"
      memory: "2Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
---
# 示例应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dev-app
  namespace: development
spec:
  replicas: 3
  selector:
    matchLabels:
      app: dev-app
  template:
    metadata:
      labels:
        app: dev-app
    spec:
      containers:
      - name: app
        image: nginx:1.20
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

应用场景：限制开发环境的资源使用，防止过度消耗集群资源。

示例2：生产环境多租户资源隔离

yaml

# 租户A命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-a
  labels:
    tenant: a
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    services: "20"
    persistentvolumeclaims: "20"
    requests.storage: "100Gi"
---
# 租户B命名空间
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-b
  labels:
    tenant: b
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-b-quota
  namespace: tenant-b
spec:
  hard:
    requests.cpu: "15"
    requests.memory: "30Gi"
    limits.cpu: "30"
    limits.memory: "60Gi"
    pods: "100"
    services: "30"
    persistentvolumeclaims: "30"
    requests.storage: "200Gi"
---
# 租户A的应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-a-app
  namespace: tenant-a
spec:
  replicas: 5
  selector:
    matchLabels:
      app: tenant-a-app
  template:
    metadata:
      labels:
        app: tenant-a-app
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
---
# 租户B的应用
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tenant-b-app
  namespace: tenant-b
spec:
  replicas: 10
  selector:
    matchLabels:
      app: tenant-b-app
  template:
    metadata:
      labels:
        app: tenant-b-app
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"

应用场景：多租户环境下，为每个租户分配独立的资源配额，实现资源隔离。

示例3：GPU资源管理

yaml

# GPU资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
    pods: "10"
---
# GPU任务
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  namespace: ml-team
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["python", "train.py"]
---
# GPU任务（使用MIG）
apiVersion: v1
kind: Pod
metadata:
  name: mig-pod
  namespace: ml-team
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1
    command: ["python", "train.py"]

应用场景：管理GPU等特殊硬件资源，确保资源合理分配。

故障排查指南

常见问题诊断

1. Pod处于Pending状态（资源不足）

bash

# 查看Pod事件
kubectl describe pod <pod-name>

# 常见错误信息
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  10s   default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

# 排查步骤
# 1. 检查节点资源
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# 2. 检查资源配额
kubectl describe resourcequota -n <namespace>

# 3. 检查Pod资源请求
kubectl get pod <pod-name> -o yaml | grep -A 10 resources

# 解决方案
# - 增加节点资源
# - 调整Pod的Requests
# - 删除不必要的Pod
# - 调整资源配额

2. OOMKilled错误

bash

# 查看Pod状态
kubectl get pod <pod-name> -o wide

# 查看Pod事件
kubectl describe pod <pod-name>

# 常见错误信息
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

# 排查步骤
# 1. 检查内存限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

# 2. 查看容器内存使用
kubectl top pod <pod-name> --containers

# 3. 查看容器日志
kubectl logs <pod-name> --previous

# 解决方案
# - 增加内存Limits
# - 优化应用内存使用
# - 检查内存泄漏
# - 调整JVM参数（Java应用）

3. 资源配额超限

bash

# 查看资源配额状态
kubectl describe resourcequota -n <namespace>

# 常见错误信息
Error from server (Forbidden): error when creating "deployment.yaml": deployments.apps is forbidden: exceeded quota: compute-quota, requested: requests.cpu=500m, used: requests.cpu=4, limited: requests.cpu=4

# 排查步骤
# 1. 查看当前资源使用
kubectl get resourcequota -n <namespace> -o yaml

# 2. 查看命名空间中的资源
kubectl get all -n <namespace>

# 3. 查看资源使用详情
kubectl top pods -n <namespace>

# 解决方案
# - 删除不必要的资源
# - 调整资源配额
# - 优化资源分配

4. CPU节流（Throttling）

bash

# 查看容器CPU使用
kubectl top pod <pod-name> --containers

# 查看容器CPU限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.cpu}'

# 检查CPU节流
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.stat

# 常见输出
nr_periods 12345
nr_throttled 6789
throttled_time 1234567890

# 解决方案
# - 增加CPU Limits
# - 优化应用性能
# - 调整Requests和Limits比例

资源监控脚本

bash

#!/bin/bash
# 资源监控脚本

echo "=== 集群资源概览 ==="
kubectl top nodes

echo -e "\n=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20

echo -e "\n=== 资源配额状态 ==="
for ns in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
  quota=$(kubectl get resourcequota -n $ns -o name 2>/dev/null)
  if [ -n "$quota" ]; then
    echo "Namespace: $ns"
    kubectl describe resourcequota -n $ns | grep -A 20 "Used\|Hard"
  fi
done

echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

echo -e "\n=== OOMKilled Pods ==="
kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | .metadata.name'

资源优化建议

yaml

# 优化前
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "4"
    memory: "8Gi"
# 问题：Requests和Limits差距过大，资源浪费

# 优化后
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"
# 建议：Requests和Limits比例控制在2-4倍

最佳实践建议

1. 资源配置策略

yaml

# 生产环境推荐配置
resources:
  requests:
    cpu: "500m"      # 基于实际使用量的80%
    memory: "512Mi"  # 基于实际使用量的80%
  limits:
    cpu: "1"         # Requests的2倍
    memory: "1Gi"    # Requests的2倍

# 开发环境推荐配置
resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# 测试环境推荐配置
resources:
  requests:
    cpu: "50m"
    memory: "64Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

2. 资源配额规划

yaml

# 命名空间资源配额规划
# 总集群资源：100 CPU, 200Gi Memory

# 生产环境（50%资源）
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    limits.cpu: "100"
    limits.memory: "200Gi"

# 预发布环境（20%资源）
apiVersion: v1
kind: ResourceQuota
metadata:
  name: staging-quota
  namespace: staging
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"

# 开发环境（30%资源）
apiVersion: v1
kind: ResourceQuota
metadata:
  name: development-quota
  namespace: development
spec:
  hard:
    requests.cpu: "30"
    requests.memory: "60Gi"
    limits.cpu: "60"
    limits.memory: "120Gi"

3. QoS最佳实践

yaml

# 关键应用：Guaranteed QoS
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "1"      # 与requests相同
            memory: "1Gi"  # 与requests相同
      priorityClassName: high-priority

# 一般应用：Burstable QoS
apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"   # 大于requests
            memory: "512Mi" # 大于requests

# 批处理任务：BestEffort QoS
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    spec:
      containers:
      - name: batch
        # 不设置resources

4. 资源监控和告警

yaml

# Prometheus规则示例
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: resource-alerts
spec:
  groups:
  - name: resource-alerts
    rules:
    - alert: NodeMemoryUsageHigh
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node memory usage is high"
        description: "Node {{ $labels.instance }} memory usage is {{ $value }}%"
    
    - alert: PodCPUThrottlingHigh
      expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod CPU throttling is high"
        description: "Pod {{ $labels.pod }} CPU throttling is {{ $value }}"
    
    - alert: NamespaceQuotaExceeded
      expr: kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Namespace quota is almost exceeded"
        description: "Namespace {{ $labels.namespace }} quota usage is {{ $value }}"

5. 资源优化清单

markdown

## 资源优化检查清单

### Pod级别
- [ ] 为所有容器设置Requests和Limits
- [ ] Requests和Limits比例合理（2-4倍）
- [ ] 根据实际使用调整资源配置
- [ ] 设置合适的QoS等级
- [ ] 配置资源监控和告警

### 命名空间级别
- [ ] 设置ResourceQuota限制总资源
- [ ] 配置LimitRange设置默认值
- [ ] 定期审查资源使用情况
- [ ] 清理未使用的资源

### 集群级别
- [ ] 监控节点资源使用率
- [ ] 设置集群自动扩缩容
- [ ] 规划资源超卖比例
- [ ] 实施资源隔离策略

总结

核心要点

Requests和Limits
- Requests用于调度决策，保证最小资源
- Limits用于运行时限制，防止资源滥用
- 合理设置比例，避免资源浪费
资源配额管理
- ResourceQuota限制命名空间总资源
- LimitRange设置默认资源限制
- 按环境和租户合理分配资源
QoS服务质量
- Guaranteed：关键应用，资源保证
- Burstable：一般应用，弹性资源
- BestEffort：批处理任务，无保证
监控和优化
- 持续监控资源使用情况
- 定期优化资源配置
- 建立资源告警机制

常用命令速查

bash

# 资源查看
kubectl top nodes
kubectl top pods --all-namespaces
kubectl describe resourcequota -n <namespace>

# 资源配额管理
kubectl create quota <name> --hard=cpu=4,memory=8Gi -n <namespace>
kubectl apply -f limitrange.yaml

# 故障排查
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --field-selector reason=FailedScheduling

# 资源监控
watch kubectl top nodes
kubectl top pods --sort-by=memory

下一步学习

Namespace - 深入学习命名空间和资源隔离
标签与选择器 - 掌握资源分组和管理
监控基础 - 学习资源监控和指标采集
自动扩缩容 - 实现资源的动态调整
Pod安全 - 了解Pod安全上下文配置

资源管理 ​

概述 ​

核心概念 ​

Requests和Limits ​

基本概念 ​

Requests和Limits详解 ​

资源分配机制 ​

多容器Pod资源管理 ​

资源超卖（Overcommit） ​

ResourceQuota资源配额 ​

基本配置 ​

完整资源配额示例 ​

按资源类型配额 ​

配额作用域 ​

LimitRange默认限制 ​

基本配置 ​

完整LimitRange示例 ​

LimitRange应用示例 ​

QoS服务质量 ​

QoS等级分类 ​

QoS与OOM分数 ​

QoS配置最佳实践 ​

实用kubectl操作命令 ​

资源查看命令 ​

资源管理命令 ​

资源监控命令 ​

实践示例 ​

示例1：开发环境资源配额管理 ​

示例2：生产环境多租户资源隔离 ​

示例3：GPU资源管理 ​

故障排查指南 ​

常见问题诊断 ​

资源监控脚本 ​

资源优化建议 ​

最佳实践建议 ​

1. 资源配置策略 ​

2. 资源配额规划 ​

3. QoS最佳实践 ​

4. 资源监控和告警 ​

5. 资源优化清单 ​

总结 ​

核心要点 ​

常用命令速查 ​

下一步学习 ​

参考资源 ​

资源管理

概述

核心概念

Requests和Limits

基本概念

Requests和Limits详解

资源分配机制

多容器Pod资源管理

资源超卖（Overcommit）

ResourceQuota资源配额

基本配置

完整资源配额示例

按资源类型配额

配额作用域

LimitRange默认限制

基本配置

完整LimitRange示例

LimitRange应用示例

QoS服务质量

QoS等级分类

QoS与OOM分数

QoS配置最佳实践

实用kubectl操作命令

资源查看命令

资源管理命令

资源监控命令

实践示例

示例1：开发环境资源配额管理

示例2：生产环境多租户资源隔离

示例3：GPU资源管理

故障排查指南

常见问题诊断

资源监控脚本

资源优化建议

最佳实践建议

1. 资源配置策略

2. 资源配额规划

3. QoS最佳实践

4. 资源监控和告警

5. 资源优化清单

总结

核心要点

常用命令速查

下一步学习

参考资源