Skip to content

指标基础

概述

Kubernetes监控体系是云原生应用可观测性的核心组成部分。本章将介绍K8S监控的基本概念、Metrics API、资源监控和性能指标。

K8S监控体系架构

监控系统组成

┌─────────────────────────────────────────────────────────┐
│                    监控数据流                          │
├─────────────────────────────────────────────────────────┤
│                                                       │
│  应用容器 → cAdvisor → kubelet → Metrics Server      │
│                                                       │
│  kubelet → Metrics API → kubectl top                │
│                                                       │
│  Metrics Server → Prometheus → Grafana              │
│                                                       │
└─────────────────────────────────────────────────────────┘

核心组件

  1. cAdvisor: 内置在kubelet中,收集容器资源使用数据
  2. kubelet: 节点代理,汇总节点和Pod指标
  3. Metrics Server: 聚合集群资源使用数据
  4. Metrics API: 提供资源使用指标API

Metrics Server部署

安装Metrics Server

yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
    rbac.authorization.k8s.io/aggregate-to-view: "true"
  name: system:aggregated-metrics-reader
rules:
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server:system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:metrics-server
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  strategy:
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      containers:
      - args:
        - --cert-dir=/tmp
        - --secure-port=4443
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
        - --kubelet-use-node-status-port
        - --metric-resolution=15s
        - --kubelet-insecure-tls
        image: k8s.gcr.io/metrics-server/metrics-server:v0.6.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
        name: metrics-server
        ports:
        - containerPort: 4443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /tmp
          name: tmp-dir
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      serviceAccountName: metrics-server
      volumes:
      - emptyDir: {}
        name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    k8s-app: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100

快速安装命令

bash
# 使用kubectl apply安装
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# 验证安装
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 检查Metrics Server状态
kubectl get deployment metrics-server -n kube-system

资源监控

节点资源监控

bash
# 查看节点资源使用情况
kubectl top nodes

# 查看详细节点信息
kubectl describe node <node-name>

# 查看节点资源分配
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU:.status.allocatable.cpu,\
MEMORY:.status.allocatable.memory,\
PODS:.status.allocatable.pods

Pod资源监控

bash
# 查看所有Pod资源使用
kubectl top pods

# 查看指定命名空间的Pod
kubectl top pods -n <namespace>

# 查看Pod及其容器资源使用
kubectl top pod <pod-name> --containers

# 按资源使用排序
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

容器资源监控

bash
# 查看Pod中每个容器的资源使用
kubectl top pod <pod-name> --containers

# 查看所有容器资源使用
kubectl top pods --all-namespaces --containers

性能指标

核心指标类型

1. CPU指标

  • CPU使用量: 容器实际使用的CPU时间(核数)
  • CPU限制: 容器配置的CPU限制
  • CPU请求: 容器配置的CPU请求
  • CPU使用率: CPU使用量/CPU限制

2. 内存指标

  • 内存使用量: 容器实际使用的内存
  • 内存限制: 容器配置的内存限制
  • 内存请求: 容器配置的内存请求
  • 内存使用率: 内存使用量/内存限制

3. 存储指标

  • 存储使用量: 持久卷使用量
  • 存储限制: 存储配额限制
  • inode使用量: inode使用情况

4. 网络指标

  • 网络接收字节数: 接收的数据量
  • 网络发送字节数: 发送的数据量
  • 网络接收包数: 接收的数据包数
  • 网络发送包数: 发送的数据包数

指标查询示例

使用kubectl查询

bash
# 查询节点CPU使用率
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq .

# 查询Pod内存使用
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods | jq .

# 查询特定Pod指标
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/<pod-name> | jq .

使用API查询

bash
# 获取节点指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq '.items[] | {name: .metadata.name, cpu: .usage.cpu, memory: .usage.memory}'

# 获取Pod指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods" | jq '.items[] | {name: .metadata.name, cpu: .containers[].usage.cpu, memory: .containers[].usage.memory}'

实践示例

示例1:资源使用监控脚本

bash
#!/bin/bash
# monitor-resources.sh

echo "=== 节点资源使用 ==="
kubectl top nodes

echo ""
echo "=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20

echo ""
echo "=== 高CPU使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=cpu | head -11

echo ""
echo "=== 高内存使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=memory | head -11

echo ""
echo "=== 资源配额使用情况 ==="
kubectl get resourcequotas --all-namespaces

示例2:资源告警检查

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: resource-alert
  namespace: monitoring
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: alert
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # 检查CPU使用率超过80%的Pod
              kubectl top pods --all-namespaces | awk 'NR>1 {if ($3 ~ /[0-9]+m/ && $3+0 > 800) print $1"/"$2": CPU usage "$3}'
              
              # 检查内存使用率超过80%的Pod
              kubectl top pods --all-namespaces | awk 'NR>1 {if ($4 ~ /[0-9]+Mi/ && $4+0 > 80) print $1"/"$2": Memory usage "$4}'
          restartPolicy: OnFailure
          serviceAccountName: monitor-sa

示例3:自定义指标导出

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: metrics-exporter
  template:
    metadata:
      labels:
        app: metrics-exporter
    spec:
      containers:
      - name: exporter
        image: python:3.9-slim
        command:
        - /bin/sh
        - -c
        - |
          pip install prometheus_client requests
          python <<EOF
          import prometheus_client as prom
          import requests
          import time
          import os
          
          # 创建指标
          cpu_usage = prom.Gauge('pod_cpu_usage_millicores', 'Pod CPU usage in millicores', ['namespace', 'pod'])
          memory_usage = prom.Gauge('pod_memory_usage_bytes', 'Pod memory usage in bytes', ['namespace', 'pod'])
          
          def collect_metrics():
              # 调用Metrics API
              response = requests.get(
                  'https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/pods',
                  headers={'Authorization': 'Bearer ' + os.environ.get('TOKEN')},
                  verify=False
              )
              
              if response.status_code == 200:
                  data = response.json()
                  for item in data['items']:
                      ns = item['metadata']['namespace']
                      pod = item['metadata']['name']
                      for container in item['containers']:
                          cpu = container['usage']['cpu']
                          memory = container['usage']['memory']
                          # 解析并设置指标
                          cpu_usage.labels(ns, pod).set(parse_cpu(cpu))
                          memory_usage.labels(ns, pod).set(parse_memory(memory))
          
          def parse_cpu(cpu_str):
              if cpu_str.endswith('m'):
                  return int(cpu_str[:-1])
              elif cpu_str.endswith('n'):
                  return int(cpu_str[:-1]) / 1000000
              else:
                  return int(cpu_str) * 1000
          
          def parse_memory(mem_str):
              if mem_str.endswith('Ki'):
                  return int(mem_str[:-2]) * 1024
              elif mem_str.endswith('Mi'):
                  return int(mem_str[:-2]) * 1024 * 1024
              elif mem_str.endswith('Gi'):
                  return int(mem_str[:-2]) * 1024 * 1024 * 1024
              else:
                  return int(mem_str)
          
          if __name__ == '__main__':
              prom.start_http_server(8080)
              while True:
                  collect_metrics()
                  time.sleep(30)
          EOF
        ports:
        - containerPort: 8080
          name: metrics
        env:
        - name: TOKEN
          valueFrom:
            secretKeyRef:
              name: metrics-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: metrics-exporter
  namespace: monitoring
spec:
  selector:
    app: metrics-exporter
  ports:
  - port: 8080
    targetPort: 8080

故障排查

常见问题

1. Metrics Server无法启动

bash
# 检查Pod状态
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 查看Pod日志
kubectl logs -n kube-system -l k8s-app=metrics-server

# 检查APIService状态
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

# 验证TLS证书
kubectl get secret -n kube-system | grep metrics

2. kubectl top命令失败

bash
# 检查Metrics Server是否运行
kubectl get deployment metrics-server -n kube-system

# 检查服务是否可访问
kubectl get svc metrics-server -n kube-system

# 测试API访问
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes

# 检查RBAC权限
kubectl auth can-i get nodes.metrics.k8s.io --as=system:anonymous

3. 指标数据不准确

bash
# 检查kubelet指标端点
curl -k https://<node-ip>:10250/metrics

# 检查cAdvisor指标
curl -k https://<node-ip>:10250/metrics/cadvisor

# 重启Metrics Server
kubectl rollout restart deployment/metrics-server -n kube-system

4. 资源使用显示为0

bash
# 检查容器资源限制
kubectl describe pod <pod-name> | grep -A 5 Limits

# 检查kubelet配置
ps aux | grep kubelet

# 验证cAdvisor是否工作
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpuacct.usage

最佳实践

1. 监控策略

  • 设置合理的监控间隔(15-30秒)
  • 配置资源告警阈值
  • 实现多级告警机制
  • 定期审查监控数据

2. 资源规划

  • 基于监控数据设置资源请求和限制
  • 预留足够的系统资源
  • 监控资源使用趋势
  • 实现自动扩缩容

3. 性能优化

  • 优化Metrics Server配置
  • 调整数据保留时间
  • 实现指标聚合
  • 使用高效的查询方式

4. 安全配置

  • 启用TLS加密
  • 配置RBAC权限
  • 限制API访问
  • 审计监控访问

5. 高可用部署

  • 部署多个Metrics Server副本
  • 配置Pod反亲和性
  • 实现故障自动恢复
  • 定期备份监控配置

总结

Kubernetes监控体系是云原生应用运维的基础。通过Metrics Server和Metrics API,可以实时监控集群和应用的资源使用情况,为容量规划、性能优化和故障排查提供数据支持。

下一步学习