Grafana
概述
Grafana是一个开源的数据可视化和监控平台,支持多种数据源,提供丰富的可视化选项和灵活的Dashboard设计能力。它与Prometheus、InfluxDB等监控系统集成,是Kubernetes监控体系中最重要的可视化工具。
Grafana架构
核心组件
┌─────────────────────────────────────────────────────────┐
│ Grafana Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web UI │ │ Dashboard │ │ Alerting │ │
│ │ │ │ Engine │ │ Engine │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Data Source │ │ Plugin │ │ Notification │
│ (Prometheus)│ │ System │ │ Channels │
└─────────────┘ └──────────────┘ └──────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Database │ │ User │ │ API │
│ (SQLite/PG) │ │ Management │ │ Endpoints │
└─────────────┘ └──────────────┘ └──────────────────┘核心概念
1. Dashboard
Dashboard是Grafana的核心概念,由多个Panel组成,用于展示监控数据的可视化视图。
2. Panel
Panel是Dashboard的基本单元,每个Panel可以展示一个或多个指标,支持多种可视化类型。
3. Data Source
数据源是Grafana获取数据的来源,支持Prometheus、InfluxDB、MySQL等多种数据源。
4. Organization
组织是Grafana的多租户隔离单元,每个组织可以有独立的Dashboard和数据源配置。
部署Grafana
方式一:使用Deployment部署
1. 创建ConfigMap配置
yaml
# grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
grafana.ini: |
[server]
root_url = http://localhost:3000
serve_from_sub_path = false
[database]
type = sqlite3
[security]
admin_user = admin
admin_password = admin123
secret_key = SW2YcwTIb9zpOOhoPsMm
[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer
[auth.basic]
enabled = true
[dashboards]
default_home_dashboard_path = /var/lib/grafana/dashboards/home.json
[users]
default_theme = dark
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
[alerting]
enabled = true
execute_alerts = true
[plugins]
allow_loading_unsigned_plugins = prometheus2. 创建数据源配置
yaml
# grafana-datasources.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
labels:
grafana_datasource: "1"
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: Alertmanager
type: alertmanager
access: proxy
url: http://alertmanager:9093
jsonData:
implementation: prometheus3. 创建Dashboard配置
yaml
# grafana-dashboards.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kubernetes-cluster.json: |
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"tags": ["kubernetes", "cluster"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (node)",
"legendFormat": "{{node}}",
"refId": "A"
}
]
},
{
"id": 2,
"title": "Cluster Memory Usage",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{container!=\"\"}) by (node)",
"legendFormat": "{{node}}",
"refId": "A"
}
]
}
]
}
}4. 部署Grafana
yaml
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.0.0
ports:
- containerPort: 3000
name: web
env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-credentials
key: admin-user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: admin-password
- name: GF_INSTALL_PLUGINS
value: "grafana-clock-panel,grafana-piechart-panel"
volumeMounts:
- name: config
mountPath: /etc/grafana
- name: storage
mountPath: /var/lib/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
- name: dashboards
mountPath: /etc/grafana/provisioning/dashboards
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /api/health
port: web
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: web
initialDelaySeconds: 5
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: grafana-config
- name: storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: datasources
configMap:
name: grafana-datasources
- name: dashboards
configMap:
name: grafana-dashboards
---
apiVersion: v1
kind: Secret
metadata:
name: grafana-credentials
namespace: monitoring
type: Opaque
stringData:
admin-user: admin
admin-password: admin123
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- port: 3000
targetPort: web
nodePort: 30300
name: web
selector:
app: grafana
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: grafana.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000部署命令
bash
kubectl apply -f grafana-config.yaml
kubectl apply -f grafana-datasources.yaml
kubectl apply -f grafana-dashboards.yaml
kubectl apply -f grafana-deployment.yaml
kubectl get pods -n monitoring -l app=grafana
kubectl get svc -n monitoring grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000方式二:使用Helm部署
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled=true \
--set persistence.size=5Gi \
--set adminPassword=admin123 \
--set service.type=NodePort \
--set service.nodePort=30300数据源配置
Prometheus数据源
手动配置
yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
prometheusVersion: "2.45.0"
cacheLevel: 'High'
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
disableRecordingRules: false配置说明
- access: 访问方式,proxy表示通过Grafana代理访问
- url: Prometheus服务地址
- isDefault: 是否为默认数据源
- editable: 是否允许在UI中编辑
- timeInterval: 数据采集间隔
- httpMethod: HTTP请求方法,POST性能更好
Loki数据源
yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
jsonData:
maxLines: 1000
derivedFields:
- name: TraceID
matcherRegex: '"traceId":"(\w+)"'
url: '$${__value.raw}'
datasourceUid: tempoElasticsearch数据源
yaml
apiVersion: 1
datasources:
- name: Elasticsearch
type: elasticsearch
access: proxy
url: http://elasticsearch:9200
database: "logstash-*"
jsonData:
esVersion: "7.10.0"
timeField: "@timestamp"
interval: Daily
logMessageField: message
logLevelField: log.levelMySQL数据源
yaml
apiVersion: 1
datasources:
- name: MySQL
type: mysql
access: proxy
url: mysql:3306
database: monitoring
user: grafana
jsonData:
maxOpenConns: 10
maxIdleConns: 5
connMaxLifetime: 14400
secureJsonData:
password: password123Dashboard设计
Dashboard结构
json
{
"dashboard": {
"id": null,
"uid": "kubernetes-cluster",
"title": "Kubernetes Cluster Monitoring",
"description": "Monitor Kubernetes cluster resources",
"tags": ["kubernetes", "cluster"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [],
"templating": {
"list": []
}
}
}Panel类型
1. Graph Panel(折线图)
json
{
"id": 1,
"title": "CPU Usage",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod)",
"legendFormat": "{{pod}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "short",
"label": "CPU",
"logBase": 1,
"show": true
}
],
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"show": true,
"total": false,
"values": true
}
}2. Stat Panel(单值面板)
json
{
"id": 2,
"title": "Total Pods",
"type": "stat",
"gridPos": {
"h": 4,
"w": 4,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "count(kube_pod_info)",
"refId": "A"
}
],
"options": {
"colorMode": "value",
"graphMode": "area",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
}
}
}3. Table Panel(表格面板)
json
{
"id": 3,
"title": "Pod List",
"type": "table",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "kube_pod_info",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"__name__": true
},
"indexByName": {},
"renameByName": {
"pod": "Pod Name",
"namespace": "Namespace",
"node": "Node"
}
}
}
]
}4. Pie Chart Panel(饼图)
json
{
"id": 4,
"title": "Resource Distribution",
"type": "piechart",
"gridPos": {
"h": 8,
"w": 6,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "sum(kube_pod_container_resource_requests{resource=\"cpu\"}) by (namespace)",
"legendFormat": "{{namespace}}",
"refId": "A"
}
],
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percentage"]
},
"pieType": "pie",
"displayLabels": ["percent"]
}
}变量配置
1. Namespace变量
json
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_pod_info, namespace)",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"sort": 1,
"multi": true,
"includeAll": true,
"allValue": ".*"
}2. Pod变量
json
{
"name": "pod",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"refresh": 1,
"sort": 1,
"multi": true,
"includeAll": true
}3. Node变量
json
{
"name": "node",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_node_info, node)",
"query": "label_values(kube_node_info, node)",
"refresh": 1,
"sort": 1,
"multi": true,
"includeAll": true
}4. Interval变量
json
{
"name": "interval",
"type": "interval",
"options": [
{"text": "1m", "value": "1m"},
{"text": "5m", "value": "5m"},
{"text": "10m", "value": "10m"},
{"text": "30m", "value": "30m"},
{"text": "1h", "value": "1h"}
],
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"refresh": 2
}实践示例
示例1:Kubernetes集群监控Dashboard
yaml
# k8s-cluster-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: k8s-cluster-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kubernetes-cluster.json: |
{
"dashboard": {
"id": null,
"uid": "k8s-cluster",
"title": "Kubernetes Cluster Overview",
"tags": ["kubernetes"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (node)",
"legendFormat": "{{node}}",
"refId": "A"
}
],
"yaxes": [
{"format": "short", "label": "Cores", "show": true}
]
},
{
"id": 2,
"title": "Cluster Memory Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{container!=\"\"}) by (node)",
"legendFormat": "{{node}}",
"refId": "A"
}
],
"yaxes": [
{"format": "bytes", "label": "Memory", "show": true}
]
},
{
"id": 3,
"title": "Pod Count",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 8},
"targets": [
{
"expr": "count(kube_pod_info)",
"refId": "A"
}
]
},
{
"id": 4,
"title": "Node Count",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 8},
"targets": [
{
"expr": "count(kube_node_info)",
"refId": "A"
}
]
},
{
"id": 5,
"title": "Namespace Count",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 8},
"targets": [
{
"expr": "count(kube_namespace_created)",
"refId": "A"
}
]
}
],
"templating": {
"list": [
{
"name": "node",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_node_info, node)",
"query": "label_values(kube_node_info, node)",
"refresh": 1,
"sort": 1,
"multi": true,
"includeAll": true
}
]
}
}
}示例2:Pod监控Dashboard
yaml
# pod-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
pod-monitoring.json: |
{
"dashboard": {
"id": null,
"uid": "pod-monitor",
"title": "Pod Monitoring",
"tags": ["kubernetes", "pod"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"id": 1,
"title": "Pod CPU Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=\"$pod\", container!=\"\"}[5m])) by (container)",
"legendFormat": "{{container}}",
"refId": "A"
}
]
},
{
"id": 2,
"title": "Pod Memory Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\", pod=\"$pod\", container!=\"\"}) by (container)",
"legendFormat": "{{container}}",
"refId": "A"
}
]
},
{
"id": 3,
"title": "Network I/O",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "rate(container_network_receive_bytes_total{namespace=\"$namespace\", pod=\"$pod\"}[5m])",
"legendFormat": "Receive",
"refId": "A"
},
{
"expr": "rate(container_network_transmit_bytes_total{namespace=\"$namespace\", pod=\"$pod\"}[5m])",
"legendFormat": "Transmit",
"refId": "B"
}
]
},
{
"id": 4,
"title": "Restarts",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 16},
"targets": [
{
"expr": "sum(kube_pod_container_status_restarts_total{namespace=\"$namespace\", pod=\"$pod\"})",
"refId": "A"
}
]
}
],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_pod_info, namespace)",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"sort": 1
},
{
"name": "pod",
"type": "query",
"datasource": "Prometheus",
"definition": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"refresh": 1,
"sort": 1
}
]
}
}
}示例3:应用性能监控Dashboard
yaml
# app-performance-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-performance-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
app-performance.json: |
{
"dashboard": {
"id": null,
"uid": "app-perf",
"title": "Application Performance Monitoring",
"tags": ["application", "performance"],
"timezone": "browser",
"refresh": "10s",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}",
"refId": "A"
}
]
},
{
"id": 2,
"title": "Response Time (P95)",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}",
"refId": "A"
}
]
},
{
"id": 3,
"title": "Error Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
"legendFormat": "{{service}}",
"refId": "A"
}
]
},
{
"id": 4,
"title": "Active Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 16},
"targets": [
{
"expr": "sum(app_active_connections)",
"refId": "A"
}
]
},
{
"id": 5,
"title": "Success Rate",
"type": "gauge",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 16},
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"2..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"refId": "A"
}
],
"options": {
"showThresholdMarkers": true,
"thresholds": [
{"color": "red", "value": 90},
{"color": "yellow", "value": 95},
{"color": "green", "value": 99}
]
}
}
]
}
}告警配置
Grafana告警规则
1. 创建告警规则
yaml
apiVersion: 1
groups:
- name: kubernetes-alerts
rules:
- uid: alert-1
title: High CPU Usage
condition: C
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: prometheus
model:
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
instant: true
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
type: reduce
expression: A
reducer: last
refId: B
- refId: C
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator:
params:
- 80
type: gt
operator:
type: and
query:
params:
- C
type: query
refId: C
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: "CPU usage is above 80%"
summary: "High CPU usage detected"
labels:
severity: warning2. 配置通知渠道
yaml
apiVersion: 1
notifiers:
- name: Slack
type: slack
uid: slack-1
isDefault: true
settings:
url: https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX
recipient: "#alerts"
username: Grafana
icon_emoji: ":grafana:"
- name: Email
type: email
uid: email-1
settings:
addresses: admin@example.com,ops@example.com
- name: PagerDuty
type: pagerduty
uid: pagerduty-1
settings:
integrationKey: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
severity: critical告警通知模板
yaml
apiVersion: 1
templates:
- name: default
template: |
{{ define "default.title" }}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "default.message" }}
Alert: {{ .CommonLabels.alertname }}
Status: {{ .Status }}
Severity: {{ .CommonLabels.severity }}
{{ range .Alerts }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
{{ end }}kubectl操作命令
Grafana资源管理
bash
kubectl get all -n monitoring -l app=grafana
kubectl get pods -n monitoring -l app=grafana
kubectl logs -n monitoring -l app=grafana -f
kubectl describe pod -n monitoring -l app=grafana
kubectl exec -it -n monitoring <grafana-pod> -- sh
kubectl port-forward -n monitoring svc/grafana 3000:3000配置管理
bash
kubectl get configmap -n monitoring
kubectl describe configmap grafana-config -n monitoring
kubectl describe configmap grafana-datasources -n monitoring
kubectl edit configmap grafana-config -n monitoring
kubectl apply -f grafana-config.yaml
kubectl rollout restart deployment/grafana -n monitoring数据持久化
bash
kubectl get pvc -n monitoring
kubectl describe pvc grafana-pvc -n monitoring
kubectl get pv | grep grafanaSecret管理
bash
kubectl get secrets -n monitoring | grep grafana
kubectl describe secret grafana-credentials -n monitoring
kubectl create secret generic grafana-credentials \
--from-literal=admin-user=admin \
--from-literal=admin-password=newpassword \
-n monitoring --dry-run=client -o yaml | kubectl apply -f -故障排查指南
问题1:Grafana无法启动
症状
bash
kubectl get pods -n monitoring -l app=grafana
NAME READY STATUS RESTARTS AGE
grafana-xxx 0/1 CrashLoopBackOff 5 10m排查步骤
bash
kubectl logs -n monitoring -l app=grafana
kubectl describe pod -n monitoring -l app=grafana
kubectl get events -n monitoring --sort-by='.lastTimestamp'
kubectl exec -n monitoring <grafana-pod> -- ls -la /var/lib/grafana解决方案
- 检查配置文件语法
- 验证存储卷权限
- 检查资源限制
- 查看环境变量配置
问题2:无法连接数据源
症状
- Dashboard显示"No data"
- 数据源测试失败
排查步骤
bash
kubectl get svc -n monitoring
kubectl exec -n monitoring <grafana-pod> -- nslookup prometheus
kubectl exec -n monitoring <grafana-pod> -- wget -qO- http://prometheus:9090/-/healthy
kubectl logs -n monitoring <grafana-pod> | grep -i datasource
kubectl describe configmap grafana-datasources -n monitoring解决方案
- 检查数据源URL配置
- 验证网络策略
- 确认服务名称正确
- 检查DNS解析
问题3:Dashboard无法加载
症状
- Dashboard列表为空
- Dashboard显示错误
排查步骤
bash
kubectl get configmap -n monitoring -l grafana_dashboard
kubectl describe configmap k8s-cluster-dashboard -n monitoring
kubectl exec -n monitoring <grafana-pod> -- ls -la /etc/grafana/provisioning/dashboards
kubectl logs -n monitoring <grafana-pod> | grep -i dashboard解决方案
- 检查Dashboard JSON格式
- 验证ConfigMap标签
- 确认挂载路径正确
- 检查数据源引用
问题4:登录失败
症状
- 无法登录Grafana
- 提示用户名或密码错误
排查步骤
bash
kubectl get secret -n monitoring grafana-credentials
kubectl describe secret grafana-credentials -n monitoring
kubectl logs -n monitoring <grafana-pod> | grep -i auth
kubectl exec -n monitoring <grafana-pod> -- cat /etc/grafana/grafana.ini | grep -A 10 security解决方案
bash
kubectl delete secret grafana-credentials -n monitoring
kubectl create secret generic grafana-credentials \
--from-literal=admin-user=admin \
--from-literal=admin-password=admin123 \
-n monitoring
kubectl rollout restart deployment/grafana -n monitoring问题5:告警不发送
症状
- 告警触发但未收到通知
- 通知渠道配置错误
排查步骤
bash
kubectl exec -n monitoring <grafana-pod> -- cat /etc/grafana/provisioning/alerting/*.yaml
kubectl logs -n monitoring <grafana-pod> | grep -i alert
kubectl exec -n monitoring <grafana-pod> -- wget -qO- http://localhost:3000/api/alert-notifications解决方案
- 检查通知渠道配置
- 验证Webhook URL
- 确认告警规则正确
- 检查网络连接
最佳实践
1. Dashboard设计最佳实践
布局设计
顶部: 关键指标概览(Stat Panel)
中部: 详细监控图表(Graph Panel)
底部: 详细数据表格(Table Panel)颜色使用
yaml
颜色方案:
- 正常: 绿色
- 警告: 黄色
- 严重: 红色
阈值设置:
- CPU: 70% 黄色, 90% 红色
- 内存: 75% 黄色, 90% 红色
- 磁盘: 80% 黄色, 95% 红色变量使用
json
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"refresh": 1,
"multi": true,
"includeAll": true
}
]
}
}2. 数据源配置最佳实践
多数据源管理
yaml
datasources:
- name: Prometheus-Production
type: prometheus
url: http://prometheus-prod:9090
- name: Prometheus-Staging
type: prometheus
url: http://prometheus-staging:9090性能优化
yaml
jsonData:
timeInterval: "15s"
httpMethod: POST
cacheLevel: 'High'
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m3. 安全最佳实践
用户权限管理
yaml
[auth]
disable_login_form = false
[auth.anonymous]
enabled = false
[users]
allow_sign_up = false
auto_assign_org_role = Viewer密钥管理
bash
kubectl create secret generic grafana-credentials \
--from-literal=admin-user=admin \
--from-literal=admin-password=$(openssl rand -base64 32) \
-n monitoring网络策略
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: grafana-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: grafana
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 3000
egress:
- to:
- namespaceSelector: {}4. 性能优化最佳实践
资源配置
yaml
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi缓存配置
yaml
[caching]
enabled = true
ttl = 5m
[remote_cache]
type = redis
connstr = addr=redis:6379,pool_size=100,db=0Dashboard优化
yaml
最佳实践:
- 限制Panel数量(每个Dashboard不超过30个)
- 使用变量减少重复查询
- 设置合理的刷新间隔
- 避免使用高基数标签5. 高可用最佳实践
多副本部署
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 2
selector:
matchLabels:
app: grafana
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: grafana
topologyKey: kubernetes.io/hostname数据库配置
yaml
[database]
type = postgres
host = postgres:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = disableDashboard模板库
Kubernetes官方Dashboard
bash
导入ID: 315
名称: Kubernetes cluster monitoring (via Prometheus)Node Exporter Dashboard
bash
导入ID: 1860
名称: Node Exporter FullNginx Ingress Dashboard
bash
导入ID: 9614
名称: Nginx Ingress Controller自定义Dashboard导入
bash
导入方式:
1. Grafana UI -> Dashboards -> Import
2. 输入Dashboard ID或上传JSON文件
3. 选择数据源
4. 点击Import总结
本章详细介绍了Grafana可视化平台的核心概念和实践方法:
- 架构理解: 掌握了Grafana的核心组件和概念
- 部署配置: 学会了多种部署方式和配置管理
- 数据源配置: 理解了多种数据源的配置方法
- Dashboard设计: 掌握了Dashboard和Panel的设计技巧
- 告警配置: 学会了配置Grafana告警规则和通知渠道
- 实践应用: 通过实际案例掌握了完整的可视化方案
- 故障排查: 掌握了常见问题的诊断和解决方法
Grafana是Kubernetes监控体系中最重要的可视化工具,为运维人员提供了直观的监控视图。
下一步学习
- 日志管理 - 学习Kubernetes日志收集和分析
- 告警管理 - 配置AlertManager告警系统
- Prometheus监控 - 深入学习Prometheus监控系统
- 指标基础 - 回顾K8S监控体系基础知识