指标基础

概述

Kubernetes监控体系是云原生应用可观测性的核心组成部分。本章将介绍K8S监控的基本概念、Metrics API、资源监控和性能指标。

K8S监控体系架构

监控系统组成

┌─────────────────────────────────────────────────────────┐
│                    监控数据流                          │
├─────────────────────────────────────────────────────────┤
│                                                       │
│  应用容器 → cAdvisor → kubelet → Metrics Server      │
│                                                       │
│  kubelet → Metrics API → kubectl top                │
│                                                       │
│  Metrics Server → Prometheus → Grafana              │
│                                                       │
└─────────────────────────────────────────────────────────┘

核心组件

cAdvisor: 内置在kubelet中，收集容器资源使用数据
kubelet: 节点代理，汇总节点和Pod指标
Metrics Server: 聚合集群资源使用数据
Metrics API: 提供资源使用指标API

Metrics Server部署

安装Metrics Server

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
    rbac.authorization.k8s.io/aggregate-to-view: "true"
  name: system:aggregated-metrics-reader
rules:
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server:system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:metrics-server
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  strategy:
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      containers:
      - args:
        - --cert-dir=/tmp
        - --secure-port=4443
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
        - --kubelet-use-node-status-port
        - --metric-resolution=15s
        - --kubelet-insecure-tls
        image: k8s.gcr.io/metrics-server/metrics-server:v0.6.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
        name: metrics-server
        ports:
        - containerPort: 4443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /tmp
          name: tmp-dir
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      serviceAccountName: metrics-server
      volumes:
      - emptyDir: {}
        name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    k8s-app: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100

快速安装命令

bash

# 使用kubectl apply安装
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# 验证安装
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 检查Metrics Server状态
kubectl get deployment metrics-server -n kube-system

资源监控

节点资源监控

bash

# 查看节点资源使用情况
kubectl top nodes

# 查看详细节点信息
kubectl describe node <node-name>

# 查看节点资源分配
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU:.status.allocatable.cpu,\
MEMORY:.status.allocatable.memory,\
PODS:.status.allocatable.pods

Pod资源监控

bash

# 查看所有Pod资源使用
kubectl top pods

# 查看指定命名空间的Pod
kubectl top pods -n <namespace>

# 查看Pod及其容器资源使用
kubectl top pod <pod-name> --containers

# 按资源使用排序
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

容器资源监控

bash

# 查看Pod中每个容器的资源使用
kubectl top pod <pod-name> --containers

# 查看所有容器资源使用
kubectl top pods --all-namespaces --containers

性能指标

核心指标类型

1. CPU指标

CPU使用量: 容器实际使用的CPU时间（核数）
CPU限制: 容器配置的CPU限制
CPU请求: 容器配置的CPU请求
CPU使用率: CPU使用量/CPU限制

2. 内存指标

内存使用量: 容器实际使用的内存
内存限制: 容器配置的内存限制
内存请求: 容器配置的内存请求
内存使用率: 内存使用量/内存限制

3. 存储指标

存储使用量: 持久卷使用量
存储限制: 存储配额限制
inode使用量: inode使用情况

4. 网络指标

网络接收字节数: 接收的数据量
网络发送字节数: 发送的数据量
网络接收包数: 接收的数据包数
网络发送包数: 发送的数据包数

指标查询示例

使用kubectl查询

bash

# 查询节点CPU使用率
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq .

# 查询Pod内存使用
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods | jq .

# 查询特定Pod指标
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/<pod-name> | jq .

使用API查询

bash

# 获取节点指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq '.items[] | {name: .metadata.name, cpu: .usage.cpu, memory: .usage.memory}'

# 获取Pod指标
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods" | jq '.items[] | {name: .metadata.name, cpu: .containers[].usage.cpu, memory: .containers[].usage.memory}'

实践示例

示例1：资源使用监控脚本

bash

#!/bin/bash
# monitor-resources.sh

echo "=== 节点资源使用 ==="
kubectl top nodes

echo ""
echo "=== 命名空间资源使用 ==="
kubectl top pods --all-namespaces | head -20

echo ""
echo "=== 高CPU使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=cpu | head -11

echo ""
echo "=== 高内存使用Pod (Top 10) ==="
kubectl top pods --all-namespaces --sort-by=memory | head -11

echo ""
echo "=== 资源配额使用情况 ==="
kubectl get resourcequotas --all-namespaces

示例2：资源告警检查

yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: resource-alert
  namespace: monitoring
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: alert
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # 检查CPU使用率超过80%的Pod
              kubectl top pods --all-namespaces | awk 'NR>1 {if ($3 ~ /[0-9]+m/ && $3+0 > 800) print $1"/"$2": CPU usage "$3}'
              
              # 检查内存使用率超过80%的Pod
              kubectl top pods --all-namespaces | awk 'NR>1 {if ($4 ~ /[0-9]+Mi/ && $4+0 > 80) print $1"/"$2": Memory usage "$4}'
          restartPolicy: OnFailure
          serviceAccountName: monitor-sa

示例3：自定义指标导出

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: metrics-exporter
  template:
    metadata:
      labels:
        app: metrics-exporter
    spec:
      containers:
      - name: exporter
        image: python:3.9-slim
        command:
        - /bin/sh
        - -c
        - |
          pip install prometheus_client requests
          python <<EOF
          import prometheus_client as prom
          import requests
          import time
          import os
          
          # 创建指标
          cpu_usage = prom.Gauge('pod_cpu_usage_millicores', 'Pod CPU usage in millicores', ['namespace', 'pod'])
          memory_usage = prom.Gauge('pod_memory_usage_bytes', 'Pod memory usage in bytes', ['namespace', 'pod'])
          
          def collect_metrics():
              # 调用Metrics API
              response = requests.get(
                  'https://kubernetes.default.svc/apis/metrics.k8s.io/v1beta1/pods',
                  headers={'Authorization': 'Bearer ' + os.environ.get('TOKEN')},
                  verify=False
              )
              
              if response.status_code == 200:
                  data = response.json()
                  for item in data['items']:
                      ns = item['metadata']['namespace']
                      pod = item['metadata']['name']
                      for container in item['containers']:
                          cpu = container['usage']['cpu']
                          memory = container['usage']['memory']
                          # 解析并设置指标
                          cpu_usage.labels(ns, pod).set(parse_cpu(cpu))
                          memory_usage.labels(ns, pod).set(parse_memory(memory))
          
          def parse_cpu(cpu_str):
              if cpu_str.endswith('m'):
                  return int(cpu_str[:-1])
              elif cpu_str.endswith('n'):
                  return int(cpu_str[:-1]) / 1000000
              else:
                  return int(cpu_str) * 1000
          
          def parse_memory(mem_str):
              if mem_str.endswith('Ki'):
                  return int(mem_str[:-2]) * 1024
              elif mem_str.endswith('Mi'):
                  return int(mem_str[:-2]) * 1024 * 1024
              elif mem_str.endswith('Gi'):
                  return int(mem_str[:-2]) * 1024 * 1024 * 1024
              else:
                  return int(mem_str)
          
          if __name__ == '__main__':
              prom.start_http_server(8080)
              while True:
                  collect_metrics()
                  time.sleep(30)
          EOF
        ports:
        - containerPort: 8080
          name: metrics
        env:
        - name: TOKEN
          valueFrom:
            secretKeyRef:
              name: metrics-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: metrics-exporter
  namespace: monitoring
spec:
  selector:
    app: metrics-exporter
  ports:
  - port: 8080
    targetPort: 8080

故障排查

常见问题

1. Metrics Server无法启动

bash

# 检查Pod状态
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 查看Pod日志
kubectl logs -n kube-system -l k8s-app=metrics-server

# 检查APIService状态
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

# 验证TLS证书
kubectl get secret -n kube-system | grep metrics

2. kubectl top命令失败

bash

# 检查Metrics Server是否运行
kubectl get deployment metrics-server -n kube-system

# 检查服务是否可访问
kubectl get svc metrics-server -n kube-system

# 测试API访问
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes

# 检查RBAC权限
kubectl auth can-i get nodes.metrics.k8s.io --as=system:anonymous

3. 指标数据不准确

bash

# 检查kubelet指标端点
curl -k https://<node-ip>:10250/metrics

# 检查cAdvisor指标
curl -k https://<node-ip>:10250/metrics/cadvisor

# 重启Metrics Server
kubectl rollout restart deployment/metrics-server -n kube-system

4. 资源使用显示为0

bash

# 检查容器资源限制
kubectl describe pod <pod-name> | grep -A 5 Limits

# 检查kubelet配置
ps aux | grep kubelet

# 验证cAdvisor是否工作
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpuacct.usage

最佳实践

1. 监控策略

设置合理的监控间隔（15-30秒）
配置资源告警阈值
实现多级告警机制
定期审查监控数据

2. 资源规划

基于监控数据设置资源请求和限制
预留足够的系统资源
监控资源使用趋势
实现自动扩缩容

3. 性能优化

优化Metrics Server配置
调整数据保留时间
实现指标聚合
使用高效的查询方式

4. 安全配置

启用TLS加密
配置RBAC权限
限制API访问
审计监控访问

5. 高可用部署

部署多个Metrics Server副本
配置Pod反亲和性
实现故障自动恢复
定期备份监控配置

总结

Kubernetes监控体系是云原生应用运维的基础。通过Metrics Server和Metrics API，可以实时监控集群和应用的资源使用情况，为容量规划、性能优化和故障排查提供数据支持。

指标基础 ​

概述 ​

K8S监控体系架构 ​

监控系统组成 ​

核心组件 ​

Metrics Server部署 ​

安装Metrics Server ​

快速安装命令 ​

资源监控 ​

节点资源监控 ​

Pod资源监控 ​

容器资源监控 ​

性能指标 ​

核心指标类型 ​

1. CPU指标 ​

2. 内存指标 ​

3. 存储指标 ​

4. 网络指标 ​

指标查询示例 ​

使用kubectl查询 ​

使用API查询 ​

实践示例 ​

示例1：资源使用监控脚本 ​

示例2：资源告警检查 ​

示例3：自定义指标导出 ​

故障排查 ​

常见问题 ​

1. Metrics Server无法启动 ​

2. kubectl top命令失败 ​

3. 指标数据不准确 ​

4. 资源使用显示为0 ​

最佳实践 ​

1. 监控策略 ​

2. 资源规划 ​

3. 性能优化 ​

4. 安全配置 ​

5. 高可用部署 ​

总结 ​

下一步学习 ​

指标基础

概述

K8S监控体系架构

监控系统组成

核心组件

Metrics Server部署

安装Metrics Server

快速安装命令

资源监控

节点资源监控

Pod资源监控

容器资源监控

性能指标

核心指标类型

1. CPU指标

2. 内存指标

3. 存储指标

4. 网络指标

指标查询示例

使用kubectl查询

使用API查询

实践示例

示例1：资源使用监控脚本

示例2：资源告警检查

示例3：自定义指标导出

故障排查

常见问题

1. Metrics Server无法启动

2. kubectl top命令失败

3. 指标数据不准确

4. 资源使用显示为0

最佳实践

1. 监控策略

2. 资源规划

3. 性能优化

4. 安全配置

5. 高可用部署

总结

下一步学习