资源管理
概述
Kubernetes资源管理是确保集群稳定运行的关键机制。通过Requests和Limits、ResourceQuota、LimitRange等机制,可以有效地分配、限制和监控集群资源,防止资源争抢和浪费。
核心概念
资源类型
- 计算资源:CPU、内存、 ephemeral-storage
- 扩展资源:GPU、FPGA等特殊硬件资源
- 存储资源:持久化存储、临时存储
资源管理机制
- Requests:容器启动所需的最小资源量,用于调度决策
- Limits:容器能使用的最大资源量,用于运行时限制
- ResourceQuota:命名空间级别的资源配额限制
- LimitRange:Pod或Container级别的默认资源限制
- QoS(服务质量):根据资源配置自动分配的服务等级
Requests和Limits
基本概念
yaml
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests: # 调度依据
cpu: "250m" # 0.25核
memory: "64Mi" # 64MB
limits: # 运行时限制
cpu: "500m" # 0.5核
memory: "128Mi" # 128MBCPU资源单位
- 1 CPU = 1 AWS vCPU / 1 GCP Core / 1 Azure vCore
- 0.5 CPU = 500m(毫核)
- 100m = 0.1核 = 10%的CPU时间
内存资源单位
- 字节:无单位后缀
- Ki/Mi/Gi/Ti:二进制单位(1024进制)
- K/M/G/T:十进制单位(1000进制)
Requests和Limits详解
yaml
apiVersion: v1
kind: Pod
metadata:
name: resource-example
spec:
containers:
- name: app
image: myapp:v1
resources:
requests:
cpu: "100m" # 最小0.1核
memory: "128Mi" # 最小128MB
ephemeral-storage: "1Gi" # 最小1GB临时存储
limits:
cpu: "200m" # 最大0.2核
memory: "256Mi" # 最大256MB
ephemeral-storage: "2Gi" # 最大2GB临时存储资源分配机制
yaml
# 场景1:只设置Requests
resources:
requests:
cpu: "100m"
memory: "128Mi"
# 结果:调度时保证最小资源,运行时可使用更多资源(无上限)
# 场景2:只设置Limits
resources:
limits:
cpu: "200m"
memory: "256Mi"
# 结果:Requests默认等于Limits(Guaranteed QoS)
# 场景3:同时设置Requests和Limits
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
# 结果:调度时保证最小资源,运行时限制最大资源(Burstable QoS)
# 场景4:都不设置
resources: {}
# 结果:调度时不保证资源,运行时无限制(BestEffort QoS)多容器Pod资源管理
yaml
apiVersion: v1
kind: Pod
metadata:
name: multi-container-pod
spec:
containers:
- name: app
image: myapp:v1
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
- name: sidecar
image: log-collector:v1
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "100m"
memory: "128Mi"
# Pod总资源 = 所有容器资源之和
# Requests: 250m CPU, 320Mi Memory
# Limits: 600m CPU, 640Mi Memory资源超卖(Overcommit)
yaml
# 节点资源分配示例
# 节点总资源:4 CPU, 8Gi Memory
# Pod 1
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
# Pod 2
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
# Pod 3
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
# 结果:
# - Requests总和:1.5 CPU, 3Gi Memory(可调度)
# - Limits总和:3 CPU, 6Gi Memory(允许超卖)
# - 超卖比例:CPU 75%, Memory 75%ResourceQuota资源配额
基本配置
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: development
spec:
hard:
requests.cpu: "4" # CPU请求总量不超过4核
requests.memory: 8Gi # 内存请求总量不超过8GB
limits.cpu: "8" # CPU限制总量不超过8核
limits.memory: 16Gi # 内存限制总量不超过16GB
pods: "10" # Pod数量不超过10个
persistentvolumeclaims: "5" # PVC数量不超过5个
requests.storage: "50Gi" # 存储请求总量不超过50GB完整资源配额示例
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: full-quota
namespace: production
spec:
hard:
# 计算资源
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
# 存储资源
persistentvolumeclaims: "10"
requests.storage: "100Gi"
# 对象数量
pods: "50"
services: "10"
secrets: "20"
configmaps: "20"
replicationcontrollers: "5"
# 特定类型的资源
count/deployments.apps: "10"
count/statefulsets.apps: "5"
count/jobs.batch: "20"
count/cronjobs.batch: "10"按资源类型配额
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-high
namespace: development
spec:
hard:
cpu: "10"
memory: "20Gi"
pods: "10"
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- high
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-medium
namespace: development
spec:
hard:
cpu: "5"
memory: "10Gi"
pods: "10"
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- medium
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-low
namespace: development
spec:
hard:
cpu: "2"
memory: "4Gi"
pods: "5"
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- low配额作用域
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: scoped-quota
namespace: development
spec:
hard:
pods: "10"
cpu: "5"
memory: "10Gi"
scopes:
- Terminating # 适用于会终止的Pod
- BestEffort # 适用于BestEffort QoS的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: not-terminating-quota
namespace: development
spec:
hard:
pods: "5"
cpu: "3"
memory: "6Gi"
scopes:
- NotTerminating # 适用于不会终止的Pod
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: terminating-quota
namespace: development
spec:
hard:
pods: "5"
cpu: "2"
memory: "4Gi"
scopes:
- Terminating # 适用于会终止的PodLimitRange默认限制
基本配置
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-mem-limit-range
namespace: development
spec:
limits:
- type: Container
default: # 默认Limits
cpu: "500m"
memory: "512Mi"
defaultRequest: # 默认Requests
cpu: "250m"
memory: "256Mi"
max: # 最大值
cpu: "2"
memory: "2Gi"
min: # 最小值
cpu: "50m"
memory: "64Mi"
maxLimitRequestRatio: # Limits/Requests最大比例
cpu: 4
memory: 2完整LimitRange示例
yaml
apiVersion: v1
kind: LimitRange
metadata:
name: comprehensive-limit-range
namespace: production
spec:
limits:
# Container限制
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "200m"
memory: "256Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
# Pod限制
- type: Pod
max:
cpu: "8"
memory: "16Gi"
# PVC限制
- type: PersistentVolumeClaim
max:
storage: "50Gi"
min:
storage: "1Gi"
# ImageStream限制(OpenShift)
- type: ImageStream
max:
openshift.io/images: "10"LimitRange应用示例
yaml
# LimitRange配置
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: default
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "250m"
memory: "256Mi"
---
# Pod未指定资源
apiVersion: v1
kind: Pod
metadata:
name: auto-resource-pod
spec:
containers:
- name: app
image: nginx:1.20
# 未指定resources,自动应用LimitRange默认值
---
# 实际应用的配置
# Requests: cpu=250m, memory=256Mi
# Limits: cpu=500m, memory=512MiQoS服务质量
QoS等级分类
yaml
# Guaranteed QoS(最高优先级)
# 条件:Requests == Limits(CPU和内存都设置且相等)
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
# 特点:
# - 资源保证,不会被抢占
# - 内存不足时最后被OOMKilled
# - 适合关键应用
---
# Burstable QoS(中等优先级)
# 条件:至少一个容器设置了Requests或Limits
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
# 特点:
# - 有最小资源保证
# - 内存不足时可能被OOMKilled
# - 适合一般应用
---
# BestEffort QoS(最低优先级)
# 条件:未设置任何Requests和Limits
apiVersion: v1
kind: Pod
metadata:
name: besteffort-pod
spec:
containers:
- name: app
image: nginx:1.20
# 特点:
# - 无资源保证
# - 内存不足时最先被OOMKilled
# - 适合批处理、测试任务QoS与OOM分数
bash
# OOM分数范围:0-1000
# 分数越高,越容易被OOMKilled
# Guaranteed Pod
# oom_score_adj = -997(最低优先级被杀)
# Burstable Pod
# oom_score_adj = min(max(2, 1000 - (1000 * memoryRequest) / memoryCapacity), 999)
# BestEffort Pod
# oom_score_adj = 1000(最高优先级被杀)
# 查看Pod的OOM分数
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'QoS配置最佳实践
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
replicas: 3
selector:
matchLabels:
app: critical
template:
metadata:
labels:
app: critical
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
spec:
containers:
- name: app
image: critical-app:v1
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1"
memory: "1Gi"
priorityClassName: high-priority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical applications"实用kubectl操作命令
资源查看命令
bash
# 查看节点资源
kubectl top nodes
kubectl describe node <node-name>
# 查看Pod资源使用
kubectl top pods
kubectl top pods --all-namespaces
kubectl top pod <pod-name> --containers
# 查看资源配额
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota <quota-name> -n <namespace>
# 查看LimitRange
kubectl get limitrange -n <namespace>
kubectl describe limitrange <limitrange-name> -n <namespace>
# 查看Pod的QoS等级
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'
# 查看Pod资源分配
kubectl get pod <pod-name> -o yaml | grep -A 10 resources资源管理命令
bash
# 设置资源配额
kubectl create quota compute-quota --hard=cpu=4,memory=8Gi,pods=10 -n development
# 创建LimitRange
kubectl apply -f limitrange.yaml
# 编辑资源配额
kubectl edit resourcequota compute-quota -n development
# 删除资源配额
kubectl delete resourcequota compute-quota -n development
# 查看命名空间资源使用情况
kubectl describe namespace development资源监控命令
bash
# 监控节点资源
kubectl top nodes --use-protocol-buffers
# 监控Pod资源
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu
# 持续监控
watch kubectl top nodes
watch kubectl top pods -n production
# 查看资源事件
kubectl get events --field-selector reason=FailedScheduling
kubectl get events --sort-by='.lastTimestamp'
# 查看资源限制
kubectl get pods -o custom-columns=\
'NAME:.metadata.name,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu'实践示例
示例1:开发环境资源配额管理
yaml
# 开发环境命名空间
apiVersion: v1
kind: Namespace
metadata:
name: development
labels:
environment: development
---
# 资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-quota
namespace: development
spec:
hard:
requests.cpu: "4"
requests.memory: "8Gi"
limits.cpu: "8"
limits.memory: "16Gi"
pods: "20"
persistentvolumeclaims: "10"
services: "10"
secrets: "20"
configmaps: "20"
---
# 默认资源限制
apiVersion: v1
kind: LimitRange
metadata:
name: dev-limits
namespace: development
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "2Gi"
min:
cpu: "50m"
memory: "64Mi"
---
# 示例应用
apiVersion: apps/v1
kind: Deployment
metadata:
name: dev-app
namespace: development
spec:
replicas: 3
selector:
matchLabels:
app: dev-app
template:
metadata:
labels:
app: dev-app
spec:
containers:
- name: app
image: nginx:1.20
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"应用场景:限制开发环境的资源使用,防止过度消耗集群资源。
示例2:生产环境多租户资源隔离
yaml
# 租户A命名空间
apiVersion: v1
kind: Namespace
metadata:
name: tenant-a
labels:
tenant: a
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-quota
namespace: tenant-a
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"
services: "20"
persistentvolumeclaims: "20"
requests.storage: "100Gi"
---
# 租户B命名空间
apiVersion: v1
kind: Namespace
metadata:
name: tenant-b
labels:
tenant: b
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-b-quota
namespace: tenant-b
spec:
hard:
requests.cpu: "15"
requests.memory: "30Gi"
limits.cpu: "30"
limits.memory: "60Gi"
pods: "100"
services: "30"
persistentvolumeclaims: "30"
requests.storage: "200Gi"
---
# 租户A的应用
apiVersion: apps/v1
kind: Deployment
metadata:
name: tenant-a-app
namespace: tenant-a
spec:
replicas: 5
selector:
matchLabels:
app: tenant-a-app
template:
metadata:
labels:
app: tenant-a-app
spec:
containers:
- name: app
image: myapp:v1
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
---
# 租户B的应用
apiVersion: apps/v1
kind: Deployment
metadata:
name: tenant-b-app
namespace: tenant-b
spec:
replicas: 10
selector:
matchLabels:
app: tenant-b-app
template:
metadata:
labels:
app: tenant-b-app
spec:
containers:
- name: app
image: myapp:v1
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"应用场景:多租户环境下,为每个租户分配独立的资源配额,实现资源隔离。
示例3:GPU资源管理
yaml
# GPU资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
pods: "10"
---
# GPU任务
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
namespace: ml-team
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.0-base
resources:
limits:
nvidia.com/gpu: 1
command: ["python", "train.py"]
---
# GPU任务(使用MIG)
apiVersion: v1
kind: Pod
metadata:
name: mig-pod
namespace: ml-team
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.0-base
resources:
limits:
nvidia.com/mig-1g.5gb: 1
command: ["python", "train.py"]应用场景:管理GPU等特殊硬件资源,确保资源合理分配。
故障排查指南
常见问题诊断
1. Pod处于Pending状态(资源不足)
bash
# 查看Pod事件
kubectl describe pod <pod-name>
# 常见错误信息
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10s default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
# 排查步骤
# 1. 检查节点资源
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# 2. 检查资源配额
kubectl describe resourcequota -n <namespace>
# 3. 检查Pod资源请求
kubectl get pod <pod-name> -o yaml | grep -A 10 resources
# 解决方案
# - 增加节点资源
# - 调整Pod的Requests
# - 删除不必要的Pod
# - 调整资源配额2. OOMKilled错误
bash
# 查看Pod状态
kubectl get pod <pod-name> -o wide
# 查看Pod事件
kubectl describe pod <pod-name>
# 常见错误信息
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
# 排查步骤
# 1. 检查内存限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# 2. 查看容器内存使用
kubectl top pod <pod-name> --containers
# 3. 查看容器日志
kubectl logs <pod-name> --previous
# 解决方案
# - 增加内存Limits
# - 优化应用内存使用
# - 检查内存泄漏
# - 调整JVM参数(Java应用)3. 资源配额超限
bash
# 查看资源配额状态
kubectl describe resourcequota -n <namespace>
# 常见错误信息
Error from server (Forbidden): error when creating "deployment.yaml": deployments.apps is forbidden: exceeded quota: compute-quota, requested: requests.cpu=500m, used: requests.cpu=4, limited: requests.cpu=4
# 排查步骤
# 1. 查看当前资源使用
kubectl get resourcequota -n <namespace> -o yaml
# 2. 查看命名空间中的资源
kubectl get all -n <namespace>
# 3. 查看资源使用详情
kubectl top pods -n <namespace>
# 解决方案
# - 删除不必要的资源
# - 调整资源配额
# - 优化资源分配4. CPU节流(Throttling)
bash
# 查看容器CPU使用
kubectl top pod <pod-name> --containers
# 查看容器CPU限制
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources.limits.cpu}'
# 检查CPU节流
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.stat
# 常见输出
nr_periods 12345
nr_throttled 6789
throttled_time 1234567890
# 解决方案
# - 增加CPU Limits
# - 优化应用性能
# - 调整Requests和Limits比例资源监控脚本
bash
#!/bin/bash
# 资源监控脚本
echo "=== 集群资源概览 ==="
kubectl top nodes
echo -e "\n=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20
echo -e "\n=== 资源配额状态 ==="
for ns in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
quota=$(kubectl get resourcequota -n $ns -o name 2>/dev/null)
if [ -n "$quota" ]; then
echo "Namespace: $ns"
kubectl describe resourcequota -n $ns | grep -A 20 "Used\|Hard"
fi
done
echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
echo -e "\n=== OOMKilled Pods ==="
kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | .metadata.name'资源优化建议
yaml
# 优化前
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "4"
memory: "8Gi"
# 问题:Requests和Limits差距过大,资源浪费
# 优化后
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
# 建议:Requests和Limits比例控制在2-4倍最佳实践建议
1. 资源配置策略
yaml
# 生产环境推荐配置
resources:
requests:
cpu: "500m" # 基于实际使用量的80%
memory: "512Mi" # 基于实际使用量的80%
limits:
cpu: "1" # Requests的2倍
memory: "1Gi" # Requests的2倍
# 开发环境推荐配置
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
# 测试环境推荐配置
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "256Mi"2. 资源配额规划
yaml
# 命名空间资源配额规划
# 总集群资源:100 CPU, 200Gi Memory
# 生产环境(50%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
limits.cpu: "100"
limits.memory: "200Gi"
# 预发布环境(20%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
name: staging-quota
namespace: staging
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
# 开发环境(30%资源)
apiVersion: v1
kind: ResourceQuota
metadata:
name: development-quota
namespace: development
spec:
hard:
requests.cpu: "30"
requests.memory: "60Gi"
limits.cpu: "60"
limits.memory: "120Gi"3. QoS最佳实践
yaml
# 关键应用:Guaranteed QoS
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
template:
spec:
containers:
- name: app
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1" # 与requests相同
memory: "1Gi" # 与requests相同
priorityClassName: high-priority
# 一般应用:Burstable QoS
apiVersion: apps/v1
kind: Deployment
metadata:
name: normal-app
spec:
template:
spec:
containers:
- name: app
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m" # 大于requests
memory: "512Mi" # 大于requests
# 批处理任务:BestEffort QoS
apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
spec:
template:
spec:
containers:
- name: batch
# 不设置resources4. 资源监控和告警
yaml
# Prometheus规则示例
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: resource-alerts
spec:
groups:
- name: resource-alerts
rules:
- alert: NodeMemoryUsageHigh
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Node memory usage is high"
description: "Node {{ $labels.instance }} memory usage is {{ $value }}%"
- alert: PodCPUThrottlingHigh
expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod CPU throttling is high"
description: "Pod {{ $labels.pod }} CPU throttling is {{ $value }}"
- alert: NamespaceQuotaExceeded
expr: kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Namespace quota is almost exceeded"
description: "Namespace {{ $labels.namespace }} quota usage is {{ $value }}"5. 资源优化清单
markdown
## 资源优化检查清单
### Pod级别
- [ ] 为所有容器设置Requests和Limits
- [ ] Requests和Limits比例合理(2-4倍)
- [ ] 根据实际使用调整资源配置
- [ ] 设置合适的QoS等级
- [ ] 配置资源监控和告警
### 命名空间级别
- [ ] 设置ResourceQuota限制总资源
- [ ] 配置LimitRange设置默认值
- [ ] 定期审查资源使用情况
- [ ] 清理未使用的资源
### 集群级别
- [ ] 监控节点资源使用率
- [ ] 设置集群自动扩缩容
- [ ] 规划资源超卖比例
- [ ] 实施资源隔离策略总结
核心要点
Requests和Limits
- Requests用于调度决策,保证最小资源
- Limits用于运行时限制,防止资源滥用
- 合理设置比例,避免资源浪费
资源配额管理
- ResourceQuota限制命名空间总资源
- LimitRange设置默认资源限制
- 按环境和租户合理分配资源
QoS服务质量
- Guaranteed:关键应用,资源保证
- Burstable:一般应用,弹性资源
- BestEffort:批处理任务,无保证
监控和优化
- 持续监控资源使用情况
- 定期优化资源配置
- 建立资源告警机制
常用命令速查
bash
# 资源查看
kubectl top nodes
kubectl top pods --all-namespaces
kubectl describe resourcequota -n <namespace>
# 资源配额管理
kubectl create quota <name> --hard=cpu=4,memory=8Gi -n <namespace>
kubectl apply -f limitrange.yaml
# 故障排查
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --field-selector reason=FailedScheduling
# 资源监控
watch kubectl top nodes
kubectl top pods --sort-by=memory