Skip to content

Grafana

概述

Grafana是一个开源的数据可视化和监控平台,支持多种数据源,提供丰富的可视化选项和灵活的Dashboard设计能力。它与Prometheus、InfluxDB等监控系统集成,是Kubernetes监控体系中最重要的可视化工具。

Grafana架构

核心组件

┌─────────────────────────────────────────────────────────┐
│                    Grafana Server                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Web UI       │  │ Dashboard    │  │ Alerting     │  │
│  │              │  │ Engine       │  │ Engine       │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘
         ↓                    ↓                    ↓
┌─────────────┐    ┌──────────────┐    ┌──────────────────┐
│ Data Source │    │ Plugin       │    │ Notification     │
│ (Prometheus)│    │ System       │    │ Channels         │
└─────────────┘    └──────────────┘    └──────────────────┘
         ↓                    ↓                    ↓
┌─────────────┐    ┌──────────────┐    ┌──────────────────┐
│ Database    │    │ User         │    │ API              │
│ (SQLite/PG) │    │ Management   │    │ Endpoints        │
└─────────────┘    └──────────────┘    └──────────────────┘

核心概念

1. Dashboard

Dashboard是Grafana的核心概念,由多个Panel组成,用于展示监控数据的可视化视图。

2. Panel

Panel是Dashboard的基本单元,每个Panel可以展示一个或多个指标,支持多种可视化类型。

3. Data Source

数据源是Grafana获取数据的来源,支持Prometheus、InfluxDB、MySQL等多种数据源。

4. Organization

组织是Grafana的多租户隔离单元,每个组织可以有独立的Dashboard和数据源配置。

部署Grafana

方式一:使用Deployment部署

1. 创建ConfigMap配置

yaml
# grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: monitoring
data:
  grafana.ini: |
    [server]
    root_url = http://localhost:3000
    serve_from_sub_path = false
    
    [database]
    type = sqlite3
    
    [security]
    admin_user = admin
    admin_password = admin123
    secret_key = SW2YcwTIb9zpOOhoPsMm
    
    [auth.anonymous]
    enabled = true
    org_name = Main Org.
    org_role = Viewer
    
    [auth.basic]
    enabled = true
    
    [dashboards]
    default_home_dashboard_path = /var/lib/grafana/dashboards/home.json
    
    [users]
    default_theme = dark
    allow_sign_up = false
    allow_org_create = false
    auto_assign_org = true
    auto_assign_org_role = Viewer
    
    [alerting]
    enabled = true
    execute_alerts = true
    
    [plugins]
    allow_loading_unsigned_plugins = prometheus

2. 创建数据源配置

yaml
# grafana-datasources.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      access: proxy
      url: http://prometheus:9090
      isDefault: true
      editable: true
      jsonData:
        timeInterval: "15s"
        httpMethod: POST
        manageAlerts: true
        prometheusType: Prometheus
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
      editable: true
    - name: Alertmanager
      type: alertmanager
      access: proxy
      url: http://alertmanager:9093
      jsonData:
        implementation: prometheus

3. 创建Dashboard配置

yaml
# grafana-dashboards.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  kubernetes-cluster.json: |
    {
      "dashboard": {
        "id": null,
        "title": "Kubernetes Cluster Monitoring",
        "tags": ["kubernetes", "cluster"],
        "timezone": "browser",
        "panels": [
          {
            "id": 1,
            "title": "Cluster CPU Usage",
            "type": "graph",
            "gridPos": {
              "h": 8,
              "w": 12,
              "x": 0,
              "y": 0
            },
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (node)",
                "legendFormat": "{{node}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 2,
            "title": "Cluster Memory Usage",
            "type": "graph",
            "gridPos": {
              "h": 8,
              "w": 12,
              "x": 12,
              "y": 0
            },
            "targets": [
              {
                "expr": "sum(container_memory_working_set_bytes{container!=\"\"}) by (node)",
                "legendFormat": "{{node}}",
                "refId": "A"
              }
            ]
          }
        ]
      }
    }

4. 部署Grafana

yaml
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:10.0.0
        ports:
        - containerPort: 3000
          name: web
        env:
        - name: GF_SECURITY_ADMIN_USER
          valueFrom:
            secretKeyRef:
              name: grafana-credentials
              key: admin-user
        - name: GF_SECURITY_ADMIN_PASSWORD
          valueFrom:
            secretKeyRef:
              name: grafana-credentials
              key: admin-password
        - name: GF_INSTALL_PLUGINS
          value: "grafana-clock-panel,grafana-piechart-panel"
        volumeMounts:
        - name: config
          mountPath: /etc/grafana
        - name: storage
          mountPath: /var/lib/grafana
        - name: datasources
          mountPath: /etc/grafana/provisioning/datasources
        - name: dashboards
          mountPath: /etc/grafana/provisioning/dashboards
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /api/health
            port: web
          initialDelaySeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /api/health
            port: web
          initialDelaySeconds: 5
          timeoutSeconds: 10
      volumes:
      - name: config
        configMap:
          name: grafana-config
      - name: storage
        persistentVolumeClaim:
          claimName: grafana-pvc
      - name: datasources
        configMap:
          name: grafana-datasources
      - name: dashboards
        configMap:
          name: grafana-dashboards
---
apiVersion: v1
kind: Secret
metadata:
  name: grafana-credentials
  namespace: monitoring
type: Opaque
stringData:
  admin-user: admin
  admin-password: admin123
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - port: 3000
    targetPort: web
    nodePort: 30300
    name: web
  selector:
    app: grafana
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: grafana.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grafana
            port:
              number: 3000

部署命令

bash
kubectl apply -f grafana-config.yaml
kubectl apply -f grafana-datasources.yaml
kubectl apply -f grafana-dashboards.yaml
kubectl apply -f grafana-deployment.yaml

kubectl get pods -n monitoring -l app=grafana
kubectl get svc -n monitoring grafana

kubectl port-forward -n monitoring svc/grafana 3000:3000

方式二:使用Helm部署

bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=5Gi \
  --set adminPassword=admin123 \
  --set service.type=NodePort \
  --set service.nodePort=30300

数据源配置

Prometheus数据源

手动配置

yaml
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://prometheus:9090
  isDefault: true
  editable: true
  jsonData:
    timeInterval: "15s"
    httpMethod: POST
    manageAlerts: true
    prometheusType: Prometheus
    prometheusVersion: "2.45.0"
    cacheLevel: 'High'
    incrementalQuerying: true
    incrementalQueryOverlapWindow: 10m
    disableRecordingRules: false

配置说明

  • access: 访问方式,proxy表示通过Grafana代理访问
  • url: Prometheus服务地址
  • isDefault: 是否为默认数据源
  • editable: 是否允许在UI中编辑
  • timeInterval: 数据采集间隔
  • httpMethod: HTTP请求方法,POST性能更好

Loki数据源

yaml
apiVersion: 1
datasources:
- name: Loki
  type: loki
  access: proxy
  url: http://loki:3100
  editable: true
  jsonData:
    maxLines: 1000
    derivedFields:
    - name: TraceID
      matcherRegex: '"traceId":"(\w+)"'
      url: '$${__value.raw}'
      datasourceUid: tempo

Elasticsearch数据源

yaml
apiVersion: 1
datasources:
- name: Elasticsearch
  type: elasticsearch
  access: proxy
  url: http://elasticsearch:9200
  database: "logstash-*"
  jsonData:
    esVersion: "7.10.0"
    timeField: "@timestamp"
    interval: Daily
    logMessageField: message
    logLevelField: log.level

MySQL数据源

yaml
apiVersion: 1
datasources:
- name: MySQL
  type: mysql
  access: proxy
  url: mysql:3306
  database: monitoring
  user: grafana
  jsonData:
    maxOpenConns: 10
    maxIdleConns: 5
    connMaxLifetime: 14400
  secureJsonData:
    password: password123

Dashboard设计

Dashboard结构

json
{
  "dashboard": {
    "id": null,
    "uid": "kubernetes-cluster",
    "title": "Kubernetes Cluster Monitoring",
    "description": "Monitor Kubernetes cluster resources",
    "tags": ["kubernetes", "cluster"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [],
    "templating": {
      "list": []
    }
  }
}

Panel类型

1. Graph Panel(折线图)

json
{
  "id": 1,
  "title": "CPU Usage",
  "type": "graph",
  "gridPos": {
    "h": 8,
    "w": 12,
    "x": 0,
    "y": 0
  },
  "targets": [
    {
      "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod)",
      "legendFormat": "{{pod}}",
      "refId": "A"
    }
  ],
  "yaxes": [
    {
      "format": "short",
      "label": "CPU",
      "logBase": 1,
      "show": true
    }
  ],
  "legend": {
    "alignAsTable": true,
    "avg": true,
    "current": true,
    "max": true,
    "show": true,
    "total": false,
    "values": true
  }
}

2. Stat Panel(单值面板)

json
{
  "id": 2,
  "title": "Total Pods",
  "type": "stat",
  "gridPos": {
    "h": 4,
    "w": 4,
    "x": 0,
    "y": 0
  },
  "targets": [
    {
      "expr": "count(kube_pod_info)",
      "refId": "A"
    }
  ],
  "options": {
    "colorMode": "value",
    "graphMode": "area",
    "reduceOptions": {
      "calcs": ["lastNotNull"],
      "fields": "",
      "values": false
    }
  }
}

3. Table Panel(表格面板)

json
{
  "id": 3,
  "title": "Pod List",
  "type": "table",
  "gridPos": {
    "h": 8,
    "w": 12,
    "x": 0,
    "y": 0
  },
  "targets": [
    {
      "expr": "kube_pod_info",
      "format": "table",
      "instant": true,
      "refId": "A"
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": {
          "Time": true,
          "__name__": true
        },
        "indexByName": {},
        "renameByName": {
          "pod": "Pod Name",
          "namespace": "Namespace",
          "node": "Node"
        }
      }
    }
  ]
}

4. Pie Chart Panel(饼图)

json
{
  "id": 4,
  "title": "Resource Distribution",
  "type": "piechart",
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 0,
    "y": 0
  },
  "targets": [
    {
      "expr": "sum(kube_pod_container_resource_requests{resource=\"cpu\"}) by (namespace)",
      "legendFormat": "{{namespace}}",
      "refId": "A"
    }
  ],
  "options": {
    "legend": {
      "displayMode": "table",
      "placement": "right",
      "values": ["value", "percentage"]
    },
    "pieType": "pie",
    "displayLabels": ["percent"]
  }
}

变量配置

1. Namespace变量

json
{
  "name": "namespace",
  "type": "query",
  "datasource": "Prometheus",
  "definition": "label_values(kube_pod_info, namespace)",
  "query": "label_values(kube_pod_info, namespace)",
  "refresh": 1,
  "sort": 1,
  "multi": true,
  "includeAll": true,
  "allValue": ".*"
}

2. Pod变量

json
{
  "name": "pod",
  "type": "query",
  "datasource": "Prometheus",
  "definition": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "refresh": 1,
  "sort": 1,
  "multi": true,
  "includeAll": true
}

3. Node变量

json
{
  "name": "node",
  "type": "query",
  "datasource": "Prometheus",
  "definition": "label_values(kube_node_info, node)",
  "query": "label_values(kube_node_info, node)",
  "refresh": 1,
  "sort": 1,
  "multi": true,
  "includeAll": true
}

4. Interval变量

json
{
  "name": "interval",
  "type": "interval",
  "options": [
    {"text": "1m", "value": "1m"},
    {"text": "5m", "value": "5m"},
    {"text": "10m", "value": "10m"},
    {"text": "30m", "value": "30m"},
    {"text": "1h", "value": "1h"}
  ],
  "auto": true,
  "auto_count": 30,
  "auto_min": "10s",
  "refresh": 2
}

实践示例

示例1:Kubernetes集群监控Dashboard

yaml
# k8s-cluster-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: k8s-cluster-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  kubernetes-cluster.json: |
    {
      "dashboard": {
        "id": null,
        "uid": "k8s-cluster",
        "title": "Kubernetes Cluster Overview",
        "tags": ["kubernetes"],
        "timezone": "browser",
        "refresh": "30s",
        "time": {
          "from": "now-1h",
          "to": "now"
        },
        "panels": [
          {
            "id": 1,
            "title": "Cluster CPU Usage",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (node)",
                "legendFormat": "{{node}}",
                "refId": "A"
              }
            ],
            "yaxes": [
              {"format": "short", "label": "Cores", "show": true}
            ]
          },
          {
            "id": 2,
            "title": "Cluster Memory Usage",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
            "targets": [
              {
                "expr": "sum(container_memory_working_set_bytes{container!=\"\"}) by (node)",
                "legendFormat": "{{node}}",
                "refId": "A"
              }
            ],
            "yaxes": [
              {"format": "bytes", "label": "Memory", "show": true}
            ]
          },
          {
            "id": 3,
            "title": "Pod Count",
            "type": "stat",
            "gridPos": {"h": 4, "w": 4, "x": 0, "y": 8},
            "targets": [
              {
                "expr": "count(kube_pod_info)",
                "refId": "A"
              }
            ]
          },
          {
            "id": 4,
            "title": "Node Count",
            "type": "stat",
            "gridPos": {"h": 4, "w": 4, "x": 4, "y": 8},
            "targets": [
              {
                "expr": "count(kube_node_info)",
                "refId": "A"
              }
            ]
          },
          {
            "id": 5,
            "title": "Namespace Count",
            "type": "stat",
            "gridPos": {"h": 4, "w": 4, "x": 8, "y": 8},
            "targets": [
              {
                "expr": "count(kube_namespace_created)",
                "refId": "A"
              }
            ]
          }
        ],
        "templating": {
          "list": [
            {
              "name": "node",
              "type": "query",
              "datasource": "Prometheus",
              "definition": "label_values(kube_node_info, node)",
              "query": "label_values(kube_node_info, node)",
              "refresh": 1,
              "sort": 1,
              "multi": true,
              "includeAll": true
            }
          ]
        }
      }
    }

示例2:Pod监控Dashboard

yaml
# pod-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  pod-monitoring.json: |
    {
      "dashboard": {
        "id": null,
        "uid": "pod-monitor",
        "title": "Pod Monitoring",
        "tags": ["kubernetes", "pod"],
        "timezone": "browser",
        "refresh": "30s",
        "panels": [
          {
            "id": 1,
            "title": "Pod CPU Usage",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"\"}[5m])) by (container)",
                "legendFormat": "{{container}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 2,
            "title": "Pod Memory Usage",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
            "targets": [
              {
                "expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\", pod=\"$pod\", container!=\"\"}) by (container)",
                "legendFormat": "{{container}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 3,
            "title": "Network I/O",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
            "targets": [
              {
                "expr": "rate(container_network_receive_bytes_total{namespace=\"$namespace\", pod=\"$pod\"}[5m])",
                "legendFormat": "Receive",
                "refId": "A"
              },
              {
                "expr": "rate(container_network_transmit_bytes_total{namespace=\"$namespace\", pod=\"$pod\"}[5m])",
                "legendFormat": "Transmit",
                "refId": "B"
              }
            ]
          },
          {
            "id": 4,
            "title": "Restarts",
            "type": "stat",
            "gridPos": {"h": 4, "w": 4, "x": 0, "y": 16},
            "targets": [
              {
                "expr": "sum(kube_pod_container_status_restarts_total{namespace=\"$namespace\", pod=\"$pod\"})",
                "refId": "A"
              }
            ]
          }
        ],
        "templating": {
          "list": [
            {
              "name": "namespace",
              "type": "query",
              "datasource": "Prometheus",
              "definition": "label_values(kube_pod_info, namespace)",
              "query": "label_values(kube_pod_info, namespace)",
              "refresh": 1,
              "sort": 1
            },
            {
              "name": "pod",
              "type": "query",
              "datasource": "Prometheus",
              "definition": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
              "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
              "refresh": 1,
              "sort": 1
            }
          ]
        }
      }
    }

示例3:应用性能监控Dashboard

yaml
# app-performance-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-performance-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  app-performance.json: |
    {
      "dashboard": {
        "id": null,
        "uid": "app-perf",
        "title": "Application Performance Monitoring",
        "tags": ["application", "performance"],
        "timezone": "browser",
        "refresh": "10s",
        "panels": [
          {
            "id": 1,
            "title": "Request Rate",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
            "targets": [
              {
                "expr": "sum(rate(http_requests_total[5m])) by (service)",
                "legendFormat": "{{service}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 2,
            "title": "Response Time (P95)",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
                "legendFormat": "{{service}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 3,
            "title": "Error Rate",
            "type": "graph",
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
                "legendFormat": "{{service}}",
                "refId": "A"
              }
            ]
          },
          {
            "id": 4,
            "title": "Active Connections",
            "type": "stat",
            "gridPos": {"h": 4, "w": 4, "x": 0, "y": 16},
            "targets": [
              {
                "expr": "sum(app_active_connections)",
                "refId": "A"
              }
            ]
          },
          {
            "id": 5,
            "title": "Success Rate",
            "type": "gauge",
            "gridPos": {"h": 4, "w": 4, "x": 4, "y": 16},
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~\"2..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
                "refId": "A"
              }
            ],
            "options": {
              "showThresholdMarkers": true,
              "thresholds": [
                {"color": "red", "value": 90},
                {"color": "yellow", "value": 95},
                {"color": "green", "value": 99}
              ]
            }
          }
        ]
      }
    }

告警配置

Grafana告警规则

1. 创建告警规则

yaml
apiVersion: 1
groups:
- name: kubernetes-alerts
  rules:
  - uid: alert-1
    title: High CPU Usage
    condition: C
    data:
    - refId: A
      relativeTimeRange:
        from: 600
        to: 0
      datasourceUid: prometheus
      model:
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
        instant: true
        intervalMs: 1000
        maxDataPoints: 43200
        refId: A
    - refId: B
      relativeTimeRange:
        from: 600
        to: 0
      datasourceUid: __expr__
      model:
        type: reduce
        expression: A
        reducer: last
        refId: B
    - refId: C
      relativeTimeRange:
        from: 600
        to: 0
      datasourceUid: __expr__
      model:
        type: threshold
        expression: B
        conditions:
        - evaluator:
            params:
            - 80
            type: gt
          operator:
            type: and
          query:
            params:
            - C
          type: query
        refId: C
    noDataState: NoData
    execErrState: Error
    for: 5m
    annotations:
      description: "CPU usage is above 80%"
      summary: "High CPU usage detected"
    labels:
      severity: warning

2. 配置通知渠道

yaml
apiVersion: 1
notifiers:
- name: Slack
  type: slack
  uid: slack-1
  isDefault: true
  settings:
    url: https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX
    recipient: "#alerts"
    username: Grafana
    icon_emoji: ":grafana:"
- name: Email
  type: email
  uid: email-1
  settings:
    addresses: admin@example.com,ops@example.com
- name: PagerDuty
  type: pagerduty
  uid: pagerduty-1
  settings:
    integrationKey: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    severity: critical

告警通知模板

yaml
apiVersion: 1
templates:
- name: default
  template: |
    {{ define "default.title" }}
    [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
    {{ end }}
    
    {{ define "default.message" }}
    Alert: {{ .CommonLabels.alertname }}
    Status: {{ .Status }}
    Severity: {{ .CommonLabels.severity }}
    
    {{ range .Alerts }}
    Summary: {{ .Annotations.summary }}
    Description: {{ .Annotations.description }}
    {{ end }}
    {{ end }}

kubectl操作命令

Grafana资源管理

bash
kubectl get all -n monitoring -l app=grafana

kubectl get pods -n monitoring -l app=grafana

kubectl logs -n monitoring -l app=grafana -f

kubectl describe pod -n monitoring -l app=grafana

kubectl exec -it -n monitoring <grafana-pod> -- sh

kubectl port-forward -n monitoring svc/grafana 3000:3000

配置管理

bash
kubectl get configmap -n monitoring

kubectl describe configmap grafana-config -n monitoring

kubectl describe configmap grafana-datasources -n monitoring

kubectl edit configmap grafana-config -n monitoring

kubectl apply -f grafana-config.yaml

kubectl rollout restart deployment/grafana -n monitoring

数据持久化

bash
kubectl get pvc -n monitoring

kubectl describe pvc grafana-pvc -n monitoring

kubectl get pv | grep grafana

Secret管理

bash
kubectl get secrets -n monitoring | grep grafana

kubectl describe secret grafana-credentials -n monitoring

kubectl create secret generic grafana-credentials \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=newpassword \
  -n monitoring --dry-run=client -o yaml | kubectl apply -f -

故障排查指南

问题1:Grafana无法启动

症状

bash
kubectl get pods -n monitoring -l app=grafana
NAME                      READY   STATUS             RESTARTS   AGE
grafana-xxx               0/1     CrashLoopBackOff   5          10m

排查步骤

bash
kubectl logs -n monitoring -l app=grafana

kubectl describe pod -n monitoring -l app=grafana

kubectl get events -n monitoring --sort-by='.lastTimestamp'

kubectl exec -n monitoring <grafana-pod> -- ls -la /var/lib/grafana

解决方案

  • 检查配置文件语法
  • 验证存储卷权限
  • 检查资源限制
  • 查看环境变量配置

问题2:无法连接数据源

症状

  • Dashboard显示"No data"
  • 数据源测试失败

排查步骤

bash
kubectl get svc -n monitoring

kubectl exec -n monitoring <grafana-pod> -- nslookup prometheus

kubectl exec -n monitoring <grafana-pod> -- wget -qO- http://prometheus:9090/-/healthy

kubectl logs -n monitoring <grafana-pod> | grep -i datasource

kubectl describe configmap grafana-datasources -n monitoring

解决方案

  • 检查数据源URL配置
  • 验证网络策略
  • 确认服务名称正确
  • 检查DNS解析

问题3:Dashboard无法加载

症状

  • Dashboard列表为空
  • Dashboard显示错误

排查步骤

bash
kubectl get configmap -n monitoring -l grafana_dashboard

kubectl describe configmap k8s-cluster-dashboard -n monitoring

kubectl exec -n monitoring <grafana-pod> -- ls -la /etc/grafana/provisioning/dashboards

kubectl logs -n monitoring <grafana-pod> | grep -i dashboard

解决方案

  • 检查Dashboard JSON格式
  • 验证ConfigMap标签
  • 确认挂载路径正确
  • 检查数据源引用

问题4:登录失败

症状

  • 无法登录Grafana
  • 提示用户名或密码错误

排查步骤

bash
kubectl get secret -n monitoring grafana-credentials

kubectl describe secret grafana-credentials -n monitoring

kubectl logs -n monitoring <grafana-pod> | grep -i auth

kubectl exec -n monitoring <grafana-pod> -- cat /etc/grafana/grafana.ini | grep -A 10 security

解决方案

bash
kubectl delete secret grafana-credentials -n monitoring

kubectl create secret generic grafana-credentials \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=admin123 \
  -n monitoring

kubectl rollout restart deployment/grafana -n monitoring

问题5:告警不发送

症状

  • 告警触发但未收到通知
  • 通知渠道配置错误

排查步骤

bash
kubectl exec -n monitoring <grafana-pod> -- cat /etc/grafana/provisioning/alerting/*.yaml

kubectl logs -n monitoring <grafana-pod> | grep -i alert

kubectl exec -n monitoring <grafana-pod> -- wget -qO- http://localhost:3000/api/alert-notifications

解决方案

  • 检查通知渠道配置
  • 验证Webhook URL
  • 确认告警规则正确
  • 检查网络连接

最佳实践

1. Dashboard设计最佳实践

布局设计

顶部: 关键指标概览(Stat Panel)
中部: 详细监控图表(Graph Panel)
底部: 详细数据表格(Table Panel)

颜色使用

yaml
颜色方案:
  - 正常: 绿色
  - 警告: 黄色
  - 严重: 红色
  
阈值设置:
  - CPU: 70% 黄色, 90% 红色
  - 内存: 75% 黄色, 90% 红色
  - 磁盘: 80% 黄色, 95% 红色

变量使用

json
{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "refresh": 1,
        "multi": true,
        "includeAll": true
      }
    ]
  }
}

2. 数据源配置最佳实践

多数据源管理

yaml
datasources:
- name: Prometheus-Production
  type: prometheus
  url: http://prometheus-prod:9090
- name: Prometheus-Staging
  type: prometheus
  url: http://prometheus-staging:9090

性能优化

yaml
jsonData:
  timeInterval: "15s"
  httpMethod: POST
  cacheLevel: 'High'
  incrementalQuerying: true
  incrementalQueryOverlapWindow: 10m

3. 安全最佳实践

用户权限管理

yaml
[auth]
disable_login_form = false

[auth.anonymous]
enabled = false

[users]
allow_sign_up = false
auto_assign_org_role = Viewer

密钥管理

bash
kubectl create secret generic grafana-credentials \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=$(openssl rand -base64 32) \
  -n monitoring

网络策略

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: grafana-network-policy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: grafana
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 3000
  egress:
  - to:
    - namespaceSelector: {}

4. 性能优化最佳实践

资源配置

yaml
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

缓存配置

yaml
[caching]
enabled = true
ttl = 5m

[remote_cache]
type = redis
connstr = addr=redis:6379,pool_size=100,db=0

Dashboard优化

yaml
最佳实践:
  - 限制Panel数量(每个Dashboard不超过30个)
  - 使用变量减少重复查询
  - 设置合理的刷新间隔
  - 避免使用高基数标签

5. 高可用最佳实践

多副本部署

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
spec:
  replicas: 2
  selector:
    matchLabels:
      app: grafana
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: grafana
              topologyKey: kubernetes.io/hostname

数据库配置

yaml
[database]
type = postgres
host = postgres:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = disable

Dashboard模板库

Kubernetes官方Dashboard

bash
导入ID: 315
名称: Kubernetes cluster monitoring (via Prometheus)

Node Exporter Dashboard

bash
导入ID: 1860
名称: Node Exporter Full

Nginx Ingress Dashboard

bash
导入ID: 9614
名称: Nginx Ingress Controller

自定义Dashboard导入

bash
导入方式:
  1. Grafana UI -> Dashboards -> Import
  2. 输入Dashboard ID或上传JSON文件
  3. 选择数据源
  4. 点击Import

总结

本章详细介绍了Grafana可视化平台的核心概念和实践方法:

  1. 架构理解: 掌握了Grafana的核心组件和概念
  2. 部署配置: 学会了多种部署方式和配置管理
  3. 数据源配置: 理解了多种数据源的配置方法
  4. Dashboard设计: 掌握了Dashboard和Panel的设计技巧
  5. 告警配置: 学会了配置Grafana告警规则和通知渠道
  6. 实践应用: 通过实际案例掌握了完整的可视化方案
  7. 故障排查: 掌握了常见问题的诊断和解决方法

Grafana是Kubernetes监控体系中最重要的可视化工具,为运维人员提供了直观的监控视图。

下一步学习