指标基础
概述
Kubernetes监控体系是云原生应用可观测性的核心组成部分。本章将介绍K8S监控的基本概念、Metrics API、资源监控和性能指标。
K8S监控体系架构
监控系统组成
┌─────────────────────────────────────────────────────────┐
│ 监控数据流 │
├─────────────────────────────────────────────────────────┤
│ │
│ 应用容器 → cAdvisor → kubelet → Metrics Server │
│ │
│ kubelet → Metrics API → kubectl top │
│ │
│ Metrics Server → Prometheus → Grafana │
│ │
└─────────────────────────────────────────────────────────┘核心组件
- cAdvisor: 内置在kubelet中,收集容器资源使用数据
- kubelet: 节点代理,汇总节点和Pod指标
- Metrics Server: 聚合集群资源使用数据
- Metrics API: 提供资源使用指标API
Metrics Server部署
安装Metrics Server
yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rbac.authorization.k8s.io/aggregate-to-view: "true"
name: system:aggregated-metrics-reader
rules:
- apiGroups:
- metrics.k8s.io
resources:
- pods
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- apiGroups:
- ""
resources:
- pods
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:metrics-server
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: https
selector:
k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: metrics-server
strategy:
rollingUpdate:
maxUnavailable: 0
template:
metadata:
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
name: metrics-server
ports:
- containerPort: 4443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 200Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- mountPath: /tmp
name: tmp-dir
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
serviceAccountName: metrics-server
volumes:
- emptyDir: {}
name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
labels:
k8s-app: metrics-server
name: v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
version: v1beta1
versionPriority: 100快速安装命令
bash
# 使用kubectl apply安装
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# 验证安装
kubectl get pods -n kube-system -l k8s-app=metrics-server
# 检查Metrics Server状态
kubectl get deployment metrics-server -n kube-system资源监控
节点资源监控
bash
# 查看节点资源使用情况
kubectl top nodes
# 查看详细节点信息
kubectl describe node <node-name>
# 查看节点资源分配
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU:.status.allocatable.cpu,\
MEMORY:.status.allocatable.memory,\
PODS:.status.allocatable.podsPod资源监控
bash
# 查看所有Pod资源使用
kubectl top pods
# 查看指定命名空间的Pod
kubectl top pods -n <namespace>
# 查看Pod及其容器资源使用
kubectl top pod <pod-name> --containers
# 按资源使用排序
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu容器资源监控
bash
# 查看Pod中每个容器的资源使用
kubectl top pod <pod-name> --containers
# 查看所有容器资源使用
kubectl top pods --all-namespaces --containers性能指标
核心指标类型
1. CPU指标
- CPU使用量: 容器实际使用的CPU时间(核数)
- CPU限制: 容器配置的CPU限制
- CPU请求: 容器配置的CPU请求
- CPU使用率: CPU使用量/CPU限制
2. 内存指标
- 内存使用量: 容器实际使用的内存
- 内存限制: 容器配置的内存限制
- 内存请求: 容器配置的内存请求
- 内存使用率: 内存使用量/内存限制
3. 存储指标
- 存储使用量: 持久卷使用量
- 存储限制: 存储配额限制
- inode使用量: inode使用情况
4. 网络指标
- 网络接收字节数: 接收的数据量
- 网络发送字节数: 发送的数据量
- 网络接收包数: 接收的数据包数
- 网络发送包数: 发送的数据包数
指标查询示例
使用kubectl查询
bash
# 查询节点CPU使用率
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq .
# 查询Pod内存使用
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods | jq .
# 查询特定Pod指标
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/<pod-name> | jq .使用API查询
bash
# 获取节点指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq '.items[] | {name: .metadata.name, cpu: .usage.cpu, memory: .usage.memory}'
# 获取Pod指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods" | jq '.items[] | {name: .metadata.name, cpu: .containers[].usage.cpu, memory: .containers[].usage.memory}'实践示例
示例1:资源使用监控脚本
bash
#!/bin/bash
# monitor-resources.sh
echo "=== 节点资源使用 ==="
kubectl top nodes
echo ""
echo "=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20
echo ""
echo "=== 高CPU使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=cpu | head -11
echo ""
echo "=== 高内存使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=memory | head -11
echo ""
echo "=== 资源配额使用情况 ==="
kubectl get resourcequotas --all-namespaces示例2:资源告警检查
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: resource-alert
namespace: monitoring
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: alert
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# 检查CPU使用率超过80%的Pod
kubectl top pods --all-namespaces | awk 'NR>1 {if ($3 ~ /[0-9]+m/ && $3+0 > 800) print $1"/"$2": CPU usage "$3}'
# 检查内存使用率超过80%的Pod
kubectl top pods --all-namespaces | awk 'NR>1 {if ($4 ~ /[0-9]+Mi/ && $4+0 > 80) print $1"/"$2": Memory usage "$4}'
restartPolicy: OnFailure
serviceAccountName: monitor-sa示例3:自定义指标导出
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: metrics-exporter
template:
metadata:
labels:
app: metrics-exporter
spec:
containers:
- name: exporter
image: python:3.9-slim
command:
- /bin/sh
- -c
- |
pip install prometheus_client requests
python <<EOF
import prometheus_client as prom
import requests
import time
import os
# 创建指标
cpu_usage = prom.Gauge('pod_cpu_usage_millicores', 'Pod CPU usage in millicores', ['namespace', 'pod'])
memory_usage = prom.Gauge('pod_memory_usage_bytes', 'Pod memory usage in bytes', ['namespace', 'pod'])
def collect_metrics():
# 调用Metrics API
response = requests.get(
'https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/pods',
headers={'Authorization': 'Bearer ' + os.environ.get('TOKEN')},
verify=False
)
if response.status_code == 200:
data = response.json()
for item in data['items']:
ns = item['metadata']['namespace']
pod = item['metadata']['name']
for container in item['containers']:
cpu = container['usage']['cpu']
memory = container['usage']['memory']
# 解析并设置指标
cpu_usage.labels(ns, pod).set(parse_cpu(cpu))
memory_usage.labels(ns, pod).set(parse_memory(memory))
def parse_cpu(cpu_str):
if cpu_str.endswith('m'):
return int(cpu_str[:-1])
elif cpu_str.endswith('n'):
return int(cpu_str[:-1]) / 1000000
else:
return int(cpu_str) * 1000
def parse_memory(mem_str):
if mem_str.endswith('Ki'):
return int(mem_str[:-2]) * 1024
elif mem_str.endswith('Mi'):
return int(mem_str[:-2]) * 1024 * 1024
elif mem_str.endswith('Gi'):
return int(mem_str[:-2]) * 1024 * 1024 * 1024
else:
return int(mem_str)
if __name__ == '__main__':
prom.start_http_server(8080)
while True:
collect_metrics()
time.sleep(30)
EOF
ports:
- containerPort: 8080
name: metrics
env:
- name: TOKEN
valueFrom:
secretKeyRef:
name: metrics-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: metrics-exporter
namespace: monitoring
spec:
selector:
app: metrics-exporter
ports:
- port: 8080
targetPort: 8080故障排查
常见问题
1. Metrics Server无法启动
bash
# 检查Pod状态
kubectl get pods -n kube-system -l k8s-app=metrics-server
# 查看Pod日志
kubectl logs -n kube-system -l k8s-app=metrics-server
# 检查APIService状态
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
# 验证TLS证书
kubectl get secret -n kube-system | grep metrics2. kubectl top命令失败
bash
# 检查Metrics Server是否运行
kubectl get deployment metrics-server -n kube-system
# 检查服务是否可访问
kubectl get svc metrics-server -n kube-system
# 测试API访问
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
# 检查RBAC权限
kubectl auth can-i get nodes.metrics.k8s.io --as=system:anonymous3. 指标数据不准确
bash
# 检查kubelet指标端点
curl -k https://<node-ip>:10250/metrics
# 检查cAdvisor指标
curl -k https://<node-ip>:10250/metrics/cadvisor
# 重启Metrics Server
kubectl rollout restart deployment/metrics-server -n kube-system4. 资源使用显示为0
bash
# 检查容器资源限制
kubectl describe pod <pod-name> | grep -A 5 Limits
# 检查kubelet配置
ps aux | grep kubelet
# 验证cAdvisor是否工作
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpuacct.usage最佳实践
1. 监控策略
- 设置合理的监控间隔(15-30秒)
- 配置资源告警阈值
- 实现多级告警机制
- 定期审查监控数据
2. 资源规划
- 基于监控数据设置资源请求和限制
- 预留足够的系统资源
- 监控资源使用趋势
- 实现自动扩缩容
3. 性能优化
- 优化Metrics Server配置
- 调整数据保留时间
- 实现指标聚合
- 使用高效的查询方式
4. 安全配置
- 启用TLS加密
- 配置RBAC权限
- 限制API访问
- 审计监控访问
5. 高可用部署
- 部署多个Metrics Server副本
- 配置Pod反亲和性
- 实现故障自动恢复
- 定期备份监控配置
总结
Kubernetes监控体系是云原生应用运维的基础。通过Metrics Server和Metrics API,可以实时监控集群和应用的资源使用情况,为容量规划、性能优化和故障排查提供数据支持。