Skip to content

告警管理

概述

告警管理是Kubernetes运维体系的重要组成部分,通过有效的告警机制可以及时发现和处理系统异常,保障业务稳定运行。本章将深入介绍AlertManager的配置、告警规则设计以及通知渠道管理的最佳实践。

AlertManager架构

核心组件

┌─────────────────────────────────────────────────────────┐
│                   Prometheus Server                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Alert Rules  │  │ Evaluation   │  │ Alert        │  │
│  │              │  │ Engine       │  │ Generation   │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘
         ↓                    ↓                    ↓
┌─────────────────────────────────────────────────────────┐
│                   AlertManager                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Deduplication│  │ Grouping     │  │ Routing      │  │
│  │              │  │              │  │              │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘
         ↓                    ↓                    ↓
┌─────────────┐    ┌──────────────┐    ┌──────────────────┐
│ Email       │    │ Slack        │    │ PagerDuty        │
│             │    │              │    │                  │
└─────────────┘    └──────────────┘    └──────────────────┘
         ↓                    ↓                    ↓
┌─────────────┐    ┌──────────────┐    ┌──────────────────┐
│ Webhook     │    │ WeChat       │    │ SMS              │
│             │    │              │    │                  │
└─────────────┘    └──────────────┘    └──────────────────┘

核心概念

1. 告警规则(Alert Rules)

定义在Prometheus中的告警条件,当条件满足时触发告警。

2. 路由(Routing)

根据告警标签将告警路由到不同的接收者。

3. 分组(Grouping)

将相似的告警合并为一个通知,减少告警噪音。

4. 抑制(Inhibition)

当某些告警触发时,抑制其他相关告警。

5. 静默(Silencing)

临时屏蔽特定告警,用于维护期间或已知问题。

部署AlertManager

方式一:使用ConfigMap配置

1. 创建AlertManager配置

yaml
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'password123'
      slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
    
    templates:
    - '/etc/alertmanager/templates/*.tmpl'
    
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default-receiver'
      routes:
      - match:
          severity: critical
        receiver: 'critical-receiver'
        continue: false
      - match:
          severity: warning
        receiver: 'warning-receiver'
        continue: false
      - match_re:
          namespace: ^(production|staging)$
        receiver: 'prod-receiver'
        continue: false
    
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    
    - name: 'critical-receiver'
      email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#critical-alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }}
          {{ end }}
          {{ end }}
      pagerduty_configs:
      - service_key: '<pagerduty-service-key>'
        severity: critical
    
    - name: 'warning-receiver'
      email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#warnings'
        send_resolved: true
    
    - name: 'prod-receiver'
      email_configs:
      - to: 'prod-team@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#production-alerts'
        send_resolved: true
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'namespace']
    
    - source_match:
        alertname: 'NodeDown'
      target_match_re:
        alertname: '.*'
      equal: ['node']

2. 创建告警模板

yaml
# alertmanager-templates.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-templates
  namespace: monitoring
data:
  default.tmpl: |
    {{ define "slack.title" }}
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
    {{ end }}
    
    {{ define "slack.text" }}
    {{ range .Alerts }}
    *Alert:* {{ .Labels.alertname }}
    *Status:* {{ .Status }}
    *Severity:* {{ .Labels.severity }}
    *Started:* {{ .StartsAt }}
    {{ if .EndsAt }}
    *Ended:* {{ .EndsAt }}
    {{ end }}
    *Summary:* {{ .Annotations.summary }}
    *Description:* {{ .Annotations.description }}
    {{ end }}
    {{ end }}
    
    {{ define "email.subject" }}
    [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonLabels.severity }}
    {{ end }}
    
    {{ define "email.body" }}
    {{ range .Alerts }}
    Alert: {{ .Labels.alertname }}
    Status: {{ .Status }}
    Severity: {{ .Labels.severity }}
    Started: {{ .StartsAt }}
    {{ if .EndsAt }}
    Ended: {{ .EndsAt }}
    {{ end }}
    
    Summary: {{ .Annotations.summary }}
    Description: {{ .Annotations.description }}
    
    Labels:
    {{ range .Labels.SortedPairs }}
      {{ .Name }}: {{ .Value }}
    {{ end }}
    
    Annotations:
    {{ range .Annotations.SortedPairs }}
      {{ .Name }}: {{ .Value }}
    {{ end }}
    ---
    {{ end }}
    {{ end }}

3. 部署AlertManager

yaml
# alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.25.0
        args:
        - "--config.file=/etc/alertmanager/alertmanager.yml"
        - "--storage.path=/alertmanager"
        - "--data.retention=120h"
        - "--web.external-url=http://alertmanager:9093"
        - "--web.route-prefix=/"
        - "--cluster.listen-address="
        ports:
        - containerPort: 9093
          name: web
        volumeMounts:
        - name: config
          mountPath: /etc/alertmanager
        - name: templates
          mountPath: /etc/alertmanager/templates
        - name: storage
          mountPath: /alertmanager
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: web
          initialDelaySeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: web
          initialDelaySeconds: 5
          timeoutSeconds: 10
      volumes:
      - name: config
        configMap:
          name: alertmanager-config
      - name: templates
        configMap:
          name: alertmanager-templates
      - name: storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  type: NodePort
  ports:
  - port: 9093
    targetPort: web
    nodePort: 30093
    name: web
  selector:
    app: alertmanager

部署命令

bash
kubectl apply -f alertmanager-config.yaml
kubectl apply -f alertmanager-templates.yaml
kubectl apply -f alertmanager-deployment.yaml

kubectl get pods -n monitoring -l app=alertmanager
kubectl get svc -n monitoring alertmanager

kubectl port-forward -n monitoring svc/alertmanager 9093:9093

方式二:使用Helm部署

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install alertmanager prometheus-community/alertmanager \
  --namespace monitoring \
  --set config.global.resolve_timeout=5m \
  --set config.global.smtp_smarthost='smtp.example.com:587' \
  --set persistence.enabled=true \
  --set persistence.size=1Gi \
  --set service.type=NodePort \
  --set service.nodePort=30093

告警规则配置

节点告警规则

yaml
# node-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-alert-rules
  namespace: monitoring
data:
  node-alerts.rules: |
    groups:
    - name: node-alerts
      interval: 30s
      rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical
          category: node
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been unreachable for more than 5 minutes"
          runbook_url: "https://wiki.example.com/runbook/node-down"
      
      - alert: NodeHighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          category: node
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% on node {{ $labels.instance }}"
          runbook_url: "https://wiki.example.com/runbook/high-cpu"
      
      - alert: NodeHighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          category: node
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% on node {{ $labels.instance }}"
          runbook_url: "https://wiki.example.com/runbook/high-memory"
      
      - alert: NodeDiskRunningLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
          category: node
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Only {{ $value | printf \"%.2f\" }}% disk space available on node {{ $labels.instance }}"
          runbook_url: "https://wiki.example.com/runbook/disk-space"
      
      - alert: NodeDiskIORate
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
          category: node
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"
          description: "Disk I/O time is {{ $value | printf \"%.2f\" }}s on node {{ $labels.instance }}"

Pod告警规则

yaml
# pod-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-alert-rules
  namespace: monitoring
data:
  pod-alerts.rules: |
    groups:
    - name: pod-alerts
      interval: 30s
      rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
        for: 5m
        labels:
          severity: warning
          category: pod
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value | printf \"%.0f\" }} times in the last 15 minutes"
          runbook_url: "https://wiki.example.com/runbook/crash-loop"
      
      - alert: PodNotReady
        expr: kube_pod_status_phase{phase=~"Pending|Unknown"} == 1
        for: 10m
        labels:
          severity: warning
          category: pod
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in {{ $labels.phase }} state for more than 10 minutes"
          runbook_url: "https://wiki.example.com/runbook/pod-not-ready"
      
      - alert: PodHighCPU
        expr: |
          sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
          /
          sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace)
          * 100 > 80
        for: 5m
        labels:
          severity: warning
          category: pod
        annotations:
          summary: "High CPU usage for pod {{ $labels.namespace }}/{{ $labels.pod }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% for pod {{ $labels.namespace }}/{{ $labels.pod }}"
          runbook_url: "https://wiki.example.com/runbook/high-cpu-pod"
      
      - alert: PodHighMemory
        expr: |
          sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
          /
          sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace)
          * 100 > 80
        for: 5m
        labels:
          severity: warning
          category: pod
        annotations:
          summary: "High memory usage for pod {{ $labels.namespace }}/{{ $labels.pod }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% for pod {{ $labels.namespace }}/{{ $labels.pod }}"
          runbook_url: "https://wiki.example.com/runbook/high-memory-pod"
      
      - alert: ContainerOomKilled
        expr: increase(container_oom_events_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
          category: pod
        annotations:
          summary: "Container OOM killed in {{ $labels.namespace }}/{{ $labels.pod }}"
          description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed"
          runbook_url: "https://wiki.example.com/runbook/oom-killed"

应用告警规则

yaml
# app-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-alert-rules
  namespace: monitoring
data:
  app-alerts.rules: |
    groups:
    - name: application-alerts
      interval: 30s
      rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          * 100 > 5
        for: 5m
        labels:
          severity: critical
          category: application
        annotations:
          summary: "High error rate for service {{ $labels.service }}"
          description: "Error rate is {{ $value | printf \"%.2f\" }}% for service {{ $labels.service }}"
          runbook_url: "https://wiki.example.com/runbook/high-error-rate"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
        for: 5m
        labels:
          severity: warning
          category: application
        annotations:
          summary: "High latency for service {{ $labels.service }}"
          description: "P95 latency is {{ $value | printf \"%.2f\" }}s for service {{ $labels.service }}"
          runbook_url: "https://wiki.example.com/runbook/high-latency"
      
      - alert: LowThroughput
        expr: sum(rate(http_requests_total[5m])) by (service) < 10
        for: 5m
        labels:
          severity: warning
          category: application
        annotations:
          summary: "Low throughput for service {{ $labels.service }}"
          description: "Throughput is {{ $value | printf \"%.2f\" }} requests/s for service {{ $labels.service }}"
          runbook_url: "https://wiki.example.com/runbook/low-throughput"
      
      - alert: ServiceDown
        expr: up{job="app"} == 0
        for: 5m
        labels:
          severity: critical
          category: application
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "Service {{ $labels.instance }} has been down for more than 5 minutes"
          runbook_url: "https://wiki.example.com/runbook/service-down"

通知渠道配置

Email配置

yaml
receivers:
- name: 'email-receiver'
  email_configs:
  - to: 'ops-team@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_password: 'password123'
    auth_identity: 'alertmanager@example.com'
    auth_secret: 'password123'
    send_resolved: true
    headers:
      Subject: '{{ template "email.subject" . }}'
    html: '{{ template "email.body" . }}'

Slack配置

yaml
receivers:
- name: 'slack-receiver'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    send_resolved: true
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    color: '{{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}good{{ end }}{{ else }}good{{ end }}'
    actions:
    - type: button
      text: 'Runbook'
      url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
    - type: button
      text: 'Query'
      url: 'https://prometheus.example.com/graph?g0.expr={{ (index .Alerts 0).Labels.alertname }}'
    - type: button
      text: 'Dashboard'
      url: 'https://grafana.example.com/d/{{ .CommonLabels.dashboard_id }}'

PagerDuty配置

yaml
receivers:
- name: 'pagerduty-receiver'
  pagerduty_configs:
  - service_key: '<pagerduty-service-key>'
    severity: '{{ .CommonLabels.severity }}'
    class: '{{ .CommonLabels.category }}'
    component: '{{ .CommonLabels.service }}'
    group: '{{ .CommonLabels.namespace }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'

Webhook配置

yaml
receivers:
- name: 'webhook-receiver'
  webhook_configs:
  - url: 'http://webhook-service:5000/alert'
    send_resolved: true
    http_config:
      basic_auth:
        username: webhook_user
        password: webhook_password

WeChat配置

yaml
receivers:
- name: 'wechat-receiver'
  wechat_configs:
  - corp_id: '<corp-id>'
    to_party: '<party-id>'
    agent_id: '<agent-id>'
    api_secret: '<api-secret>'
    send_resolved: true
    message: '{{ template "wechat.message" . }}'

实践示例

示例1:完整告警配置

yaml
# complete-alerting.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: complete-alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'password123'
      slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
    
    route:
      group_by: ['alertname', 'namespace', 'severity']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default-receiver'
      routes:
      - match:
          severity: critical
        receiver: 'critical-receiver'
        group_wait: 10s
        repeat_interval: 1h
      - match:
          severity: warning
        receiver: 'warning-receiver'
        repeat_interval: 3h
      - match:
          namespace: production
        receiver: 'prod-receiver'
    
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    
    - name: 'critical-receiver'
      email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#critical-alerts'
        send_resolved: true
      pagerduty_configs:
      - service_key: '<pagerduty-service-key>'
        severity: critical
    
    - name: 'warning-receiver'
      email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#warnings'
        send_resolved: true
    
    - name: 'prod-receiver'
      email_configs:
      - to: 'prod-team@example.com'
        send_resolved: true
      slack_configs:
      - channel: '#production-alerts'
        send_resolved: true
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'namespace']
    
    - source_match:
        alertname: 'NodeDown'
      target_match_re:
        alertname: '.*'
      equal: ['node']

示例2:告警静默配置

bash
# 创建静默规则
curl -X POST http://alertmanager:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "NodeHighCPU",
        "isRegex": false
      },
      {
        "name": "node",
        "value": "node-1",
        "isRegex": false
      }
    ],
    "startsAt": "2024-01-15T10:00:00Z",
    "endsAt": "2024-01-15T12:00:00Z",
    "createdBy": "admin@example.com",
    "comment": "Planned maintenance on node-1"
  }'

# 查看静默规则
curl http://alertmanager:9093/api/v1/silences

# 删除静默规则
curl -X DELETE http://alertmanager:9093/api/v1/silence/<silence-id>

示例3:告警路由配置

yaml
# advanced-routing.yaml
route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
  # 严重告警立即通知
  - match:
      severity: critical
    receiver: 'critical-receiver'
    group_wait: 10s
    repeat_interval: 1h
    continue: false
  
  # 生产环境告警
  - match:
      namespace: production
    receiver: 'prod-receiver'
    routes:
    - match:
        severity: critical
      receiver: 'prod-critical-receiver'
  
  # 开发环境告警
  - match:
      namespace: development
    receiver: 'dev-receiver'
    repeat_interval: 12h
  
  # 特定服务告警
  - match_re:
      service: ^(api|web|backend)$
    receiver: 'service-receiver'
    routes:
    - match:
        service: api
      receiver: 'api-team-receiver'
    - match:
        service: web
      receiver: 'web-team-receiver'
    - match:
        service: backend
      receiver: 'backend-team-receiver'
  
  # 节点告警
  - match:
      category: node
    receiver: 'infra-team-receiver'
  
  # Pod告警
  - match:
      category: pod
    receiver: 'app-team-receiver'

kubectl操作命令

AlertManager资源管理

bash
kubectl get all -n monitoring -l app=alertmanager

kubectl get pods -n monitoring -l app=alertmanager

kubectl logs -n monitoring -l app=alertmanager -f

kubectl describe pod -n monitoring -l app=alertmanager

kubectl exec -it -n monitoring <alertmanager-pod> -- sh

kubectl port-forward -n monitoring svc/alertmanager 9093:9093

配置管理

bash
kubectl get configmap -n monitoring

kubectl describe configmap alertmanager-config -n monitoring

kubectl edit configmap alertmanager-config -n monitoring

kubectl apply -f alertmanager-config.yaml

kubectl rollout restart deployment/alertmanager -n monitoring

告警管理

bash
# 查看当前告警
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts

# 查看静默规则
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/silences

# 查看状态
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/status

# 测试告警路由
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- --post-data='[{"labels":{"alertname":"TestAlert","severity":"warning"}}]' http://localhost:9093/api/v1/routes

Prometheus告警查看

bash
# 查看告警规则
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules

# 查看当前告警
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alerts

# 查看AlertManager配置
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alertmanagers

故障排查指南

问题1:告警未触发

症状

  • 告警规则存在但未触发
  • Prometheus中看不到告警

排查步骤

bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules

kubectl exec -n monitoring <prometheus-pod> -- wget -qO- 'http://localhost:9090/api/v1/query?query=<alert-expression>'

kubectl logs -n monitoring <prometheus-pod> | grep -i alert

kubectl describe configmap prometheus-rules -n monitoring

解决方案

  • 检查告警规则语法
  • 验证PromQL表达式
  • 确认指标数据存在
  • 检查告警持续时间

问题2:告警未发送

症状

  • 告警触发但未收到通知
  • AlertManager未收到告警

排查步骤

bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts

kubectl logs -n monitoring <alertmanager-pod> | grep -i error

kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alertmanagers

kubectl describe configmap alertmanager-config -n monitoring

解决方案

  • 检查AlertManager配置
  • 验证Prometheus和AlertManager连接
  • 确认通知渠道配置正确
  • 检查网络策略

问题3:告警噪音过多

症状

  • 收到大量重复告警
  • 无关紧要的告警过多

排查步骤

bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts

kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/silences

kubectl describe configmap alertmanager-config -n monitoring

解决方案

  • 调整告警分组配置
  • 增加告警重复间隔
  • 配置告警抑制规则
  • 添加告警静默规则

问题4:告警路由错误

症状

  • 告警发送到错误的接收者
  • 特定告警未按预期路由

排查步骤

bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/routes

kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- --post-data='[{"labels":{"alertname":"TestAlert","severity":"warning"}}]' http://localhost:9093/api/v1/routes

kubectl describe configmap alertmanager-config -n monitoring

解决方案

  • 检查路由配置
  • 验证标签匹配规则
  • 测试告警路由
  • 调整路由优先级

问题5:通知渠道失败

症状

  • Email发送失败
  • Slack通知失败

排查步骤

bash
kubectl logs -n monitoring <alertmanager-pod> | grep -i smtp

kubectl logs -n monitoring <alertmanager-pod> | grep -i slack

kubectl exec -n monitoring <alertmanager-pod> -- cat /etc/alertmanager/alertmanager.yml

kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/status

解决方案

  • 检查SMTP配置
  • 验证Slack Webhook URL
  • 测试网络连接
  • 检查认证信息

最佳实践

1. 告警规则最佳实践

告警分级

yaml
severity:
  - critical: 立即处理,影响业务
  - warning: 尽快处理,潜在影响
  - info: 关注即可,不影响业务

告警命名规范

yaml
命名格式: <组件><问题><级别>
示例:
  - NodeHighCPU
  - PodCrashLooping
  - ServiceDown

告警内容规范

yaml
annotations:
  summary: 简短描述,包含关键信息
  description: 详细描述,包含具体数值和影响范围
  runbook_url: 故障处理手册链接

2. 告警路由最佳实践

分层路由

yaml
route:
  routes:
  - match:
      severity: critical
    receiver: critical-receiver
  - match:
      severity: warning
    receiver: warning-receiver
  - match:
      namespace: production
    receiver: prod-receiver

合理分组

yaml
route:
  group_by: ['alertname', 'namespace', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

3. 告警抑制最佳实践

层级抑制

yaml
inhibit_rules:
- source_match:
    severity: critical
  target_match:
    severity: warning
  equal: ['alertname', 'namespace']

- source_match:
    alertname: NodeDown
  target_match_re:
    alertname: '.*'
  equal: ['node']

4. 告警静默最佳实践

维护窗口静默

bash
curl -X POST http://alertmanager:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "node", "value": "node-1", "isRegex": false}
    ],
    "startsAt": "2024-01-15T10:00:00Z",
    "endsAt": "2024-01-15T12:00:00Z",
    "createdBy": "admin@example.com",
    "comment": "Planned maintenance"
  }'

5. 通知渠道最佳实践

多渠道通知

yaml
receivers:
- name: 'critical-receiver'
  email_configs:
  - to: 'oncall@example.com'
  slack_configs:
  - channel: '#critical-alerts'
  pagerduty_configs:
  - service_key: '<key>'

通知内容优化

yaml
slack_configs:
- title: '{{ template "slack.title" . }}'
  text: '{{ template "slack.text" . }}'
  actions:
  - type: button
    text: 'Runbook'
    url: '{{ (index .Alerts 0).Annotations.runbook_url }}'

总结

本章详细介绍了Kubernetes告警管理的核心概念和实践方法:

  1. 架构理解: 掌握了AlertManager的核心组件和工作原理
  2. 部署配置: 学会了AlertManager的部署和配置方法
  3. 告警规则: 理解了告警规则的设计和配置
  4. 通知渠道: 掌握了多种通知渠道的配置方法
  5. 实践应用: 通过实际案例掌握了完整的告警方案
  6. 故障排查: 掌握了常见问题的诊断和解决方法

告警管理是Kubernetes运维的关键环节,为及时发现和处理系统异常提供了重要保障。

下一步学习