告警管理
概述
告警管理是Kubernetes运维体系的重要组成部分,通过有效的告警机制可以及时发现和处理系统异常,保障业务稳定运行。本章将深入介绍AlertManager的配置、告警规则设计以及通知渠道管理的最佳实践。
AlertManager架构
核心组件
┌─────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Alert Rules │ │ Evaluation │ │ Alert │ │
│ │ │ │ Engine │ │ Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────────────────────────────────────────────────┐
│ AlertManager │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Deduplication│ │ Grouping │ │ Routing │ │
│ │ │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Email │ │ Slack │ │ PagerDuty │
│ │ │ │ │ │
└─────────────┘ └──────────────┘ └──────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Webhook │ │ WeChat │ │ SMS │
│ │ │ │ │ │
└─────────────┘ └──────────────┘ └──────────────────┘核心概念
1. 告警规则(Alert Rules)
定义在Prometheus中的告警条件,当条件满足时触发告警。
2. 路由(Routing)
根据告警标签将告警路由到不同的接收者。
3. 分组(Grouping)
将相似的告警合并为一个通知,减少告警噪音。
4. 抑制(Inhibition)
当某些告警触发时,抑制其他相关告警。
5. 静默(Silencing)
临时屏蔽特定告警,用于维护期间或已知问题。
部署AlertManager
方式一:使用ConfigMap配置
1. 创建AlertManager配置
yaml
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password123'
slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
continue: false
- match:
severity: warning
receiver: 'warning-receiver'
continue: false
- match_re:
namespace: ^(production|staging)$
receiver: 'prod-receiver'
continue: false
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- channel: '#critical-alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* {{ .Value }}
{{ end }}
{{ end }}
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
severity: critical
- name: 'warning-receiver'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
slack_configs:
- channel: '#warnings'
send_resolved: true
- name: 'prod-receiver'
email_configs:
- to: 'prod-team@example.com'
send_resolved: true
slack_configs:
- channel: '#production-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace']
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.*'
equal: ['node']2. 创建告警模板
yaml
# alertmanager-templates.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-templates
namespace: monitoring
data:
default.tmpl: |
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.text" }}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Status:* {{ .Status }}
*Severity:* {{ .Labels.severity }}
*Started:* {{ .StartsAt }}
{{ if .EndsAt }}
*Ended:* {{ .EndsAt }}
{{ end }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}
{{ end }}
{{ define "email.subject" }}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonLabels.severity }}
{{ end }}
{{ define "email.body" }}
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Status: {{ .Status }}
Severity: {{ .Labels.severity }}
Started: {{ .StartsAt }}
{{ if .EndsAt }}
Ended: {{ .EndsAt }}
{{ end }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels:
{{ range .Labels.SortedPairs }}
{{ .Name }}: {{ .Value }}
{{ end }}
Annotations:
{{ range .Annotations.SortedPairs }}
{{ .Name }}: {{ .Value }}
{{ end }}
---
{{ end }}
{{ end }}3. 部署AlertManager
yaml
# alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.25.0
args:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
- "--data.retention=120h"
- "--web.external-url=http://alertmanager:9093"
- "--web.route-prefix=/"
- "--cluster.listen-address="
ports:
- containerPort: 9093
name: web
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: templates
mountPath: /etc/alertmanager/templates
- name: storage
mountPath: /alertmanager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
livenessProbe:
httpGet:
path: /-/healthy
port: web
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: web
initialDelaySeconds: 5
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: alertmanager-config
- name: templates
configMap:
name: alertmanager-templates
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
spec:
type: NodePort
ports:
- port: 9093
targetPort: web
nodePort: 30093
name: web
selector:
app: alertmanager部署命令
bash
kubectl apply -f alertmanager-config.yaml
kubectl apply -f alertmanager-templates.yaml
kubectl apply -f alertmanager-deployment.yaml
kubectl get pods -n monitoring -l app=alertmanager
kubectl get svc -n monitoring alertmanager
kubectl port-forward -n monitoring svc/alertmanager 9093:9093方式二:使用Helm部署
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install alertmanager prometheus-community/alertmanager \
--namespace monitoring \
--set config.global.resolve_timeout=5m \
--set config.global.smtp_smarthost='smtp.example.com:587' \
--set persistence.enabled=true \
--set persistence.size=1Gi \
--set service.type=NodePort \
--set service.nodePort=30093告警规则配置
节点告警规则
yaml
# node-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: node-alert-rules
namespace: monitoring
data:
node-alerts.rules: |
groups:
- name: node-alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
category: node
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been unreachable for more than 5 minutes"
runbook_url: "https://wiki.example.com/runbook/node-down"
- alert: NodeHighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
category: node
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}% on node {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbook/high-cpu"
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
category: node
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}% on node {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbook/high-memory"
- alert: NodeDiskRunningLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
category: node
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Only {{ $value | printf \"%.2f\" }}% disk space available on node {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbook/disk-space"
- alert: NodeDiskIORate
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
category: node
annotations:
summary: "High disk I/O on {{ $labels.instance }}"
description: "Disk I/O time is {{ $value | printf \"%.2f\" }}s on node {{ $labels.instance }}"Pod告警规则
yaml
# pod-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-alert-rules
namespace: monitoring
data:
pod-alerts.rules: |
groups:
- name: pod-alerts
interval: 30s
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
category: pod
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value | printf \"%.0f\" }} times in the last 15 minutes"
runbook_url: "https://wiki.example.com/runbook/crash-loop"
- alert: PodNotReady
expr: kube_pod_status_phase{phase=~"Pending|Unknown"} == 1
for: 10m
labels:
severity: warning
category: pod
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in {{ $labels.phase }} state for more than 10 minutes"
runbook_url: "https://wiki.example.com/runbook/pod-not-ready"
- alert: PodHighCPU
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/
sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace)
* 100 > 80
for: 5m
labels:
severity: warning
category: pod
annotations:
summary: "High CPU usage for pod {{ $labels.namespace }}/{{ $labels.pod }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}% for pod {{ $labels.namespace }}/{{ $labels.pod }}"
runbook_url: "https://wiki.example.com/runbook/high-cpu-pod"
- alert: PodHighMemory
expr: |
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
/
sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace)
* 100 > 80
for: 5m
labels:
severity: warning
category: pod
annotations:
summary: "High memory usage for pod {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}% for pod {{ $labels.namespace }}/{{ $labels.pod }}"
runbook_url: "https://wiki.example.com/runbook/high-memory-pod"
- alert: ContainerOomKilled
expr: increase(container_oom_events_total[5m]) > 0
for: 1m
labels:
severity: critical
category: pod
annotations:
summary: "Container OOM killed in {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed"
runbook_url: "https://wiki.example.com/runbook/oom-killed"应用告警规则
yaml
# app-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-alert-rules
namespace: monitoring
data:
app-alerts.rules: |
groups:
- name: application-alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
* 100 > 5
for: 5m
labels:
severity: critical
category: application
annotations:
summary: "High error rate for service {{ $labels.service }}"
description: "Error rate is {{ $value | printf \"%.2f\" }}% for service {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbook/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
for: 5m
labels:
severity: warning
category: application
annotations:
summary: "High latency for service {{ $labels.service }}"
description: "P95 latency is {{ $value | printf \"%.2f\" }}s for service {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbook/high-latency"
- alert: LowThroughput
expr: sum(rate(http_requests_total[5m])) by (service) < 10
for: 5m
labels:
severity: warning
category: application
annotations:
summary: "Low throughput for service {{ $labels.service }}"
description: "Throughput is {{ $value | printf \"%.2f\" }} requests/s for service {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbook/low-throughput"
- alert: ServiceDown
expr: up{job="app"} == 0
for: 5m
labels:
severity: critical
category: application
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Service {{ $labels.instance }} has been down for more than 5 minutes"
runbook_url: "https://wiki.example.com/runbook/service-down"通知渠道配置
Email配置
yaml
receivers:
- name: 'email-receiver'
email_configs:
- to: 'ops-team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password123'
auth_identity: 'alertmanager@example.com'
auth_secret: 'password123'
send_resolved: true
headers:
Subject: '{{ template "email.subject" . }}'
html: '{{ template "email.body" . }}'Slack配置
yaml
receivers:
- name: 'slack-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
send_resolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
color: '{{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}good{{ end }}{{ else }}good{{ end }}'
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'Query'
url: 'https://prometheus.example.com/graph?g0.expr={{ (index .Alerts 0).Labels.alertname }}'
- type: button
text: 'Dashboard'
url: 'https://grafana.example.com/d/{{ .CommonLabels.dashboard_id }}'PagerDuty配置
yaml
receivers:
- name: 'pagerduty-receiver'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
severity: '{{ .CommonLabels.severity }}'
class: '{{ .CommonLabels.category }}'
component: '{{ .CommonLabels.service }}'
group: '{{ .CommonLabels.namespace }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'Webhook配置
yaml
receivers:
- name: 'webhook-receiver'
webhook_configs:
- url: 'http://webhook-service:5000/alert'
send_resolved: true
http_config:
basic_auth:
username: webhook_user
password: webhook_passwordWeChat配置
yaml
receivers:
- name: 'wechat-receiver'
wechat_configs:
- corp_id: '<corp-id>'
to_party: '<party-id>'
agent_id: '<agent-id>'
api_secret: '<api-secret>'
send_resolved: true
message: '{{ template "wechat.message" . }}'实践示例
示例1:完整告警配置
yaml
# complete-alerting.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: complete-alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password123'
slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
route:
group_by: ['alertname', 'namespace', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
repeat_interval: 1h
- match:
severity: warning
receiver: 'warning-receiver'
repeat_interval: 3h
- match:
namespace: production
receiver: 'prod-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- channel: '#critical-alerts'
send_resolved: true
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
severity: critical
- name: 'warning-receiver'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
slack_configs:
- channel: '#warnings'
send_resolved: true
- name: 'prod-receiver'
email_configs:
- to: 'prod-team@example.com'
send_resolved: true
slack_configs:
- channel: '#production-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace']
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.*'
equal: ['node']示例2:告警静默配置
bash
# 创建静默规则
curl -X POST http://alertmanager:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "NodeHighCPU",
"isRegex": false
},
{
"name": "node",
"value": "node-1",
"isRegex": false
}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T12:00:00Z",
"createdBy": "admin@example.com",
"comment": "Planned maintenance on node-1"
}'
# 查看静默规则
curl http://alertmanager:9093/api/v1/silences
# 删除静默规则
curl -X DELETE http://alertmanager:9093/api/v1/silence/<silence-id>示例3:告警路由配置
yaml
# advanced-routing.yaml
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
routes:
# 严重告警立即通知
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
repeat_interval: 1h
continue: false
# 生产环境告警
- match:
namespace: production
receiver: 'prod-receiver'
routes:
- match:
severity: critical
receiver: 'prod-critical-receiver'
# 开发环境告警
- match:
namespace: development
receiver: 'dev-receiver'
repeat_interval: 12h
# 特定服务告警
- match_re:
service: ^(api|web|backend)$
receiver: 'service-receiver'
routes:
- match:
service: api
receiver: 'api-team-receiver'
- match:
service: web
receiver: 'web-team-receiver'
- match:
service: backend
receiver: 'backend-team-receiver'
# 节点告警
- match:
category: node
receiver: 'infra-team-receiver'
# Pod告警
- match:
category: pod
receiver: 'app-team-receiver'kubectl操作命令
AlertManager资源管理
bash
kubectl get all -n monitoring -l app=alertmanager
kubectl get pods -n monitoring -l app=alertmanager
kubectl logs -n monitoring -l app=alertmanager -f
kubectl describe pod -n monitoring -l app=alertmanager
kubectl exec -it -n monitoring <alertmanager-pod> -- sh
kubectl port-forward -n monitoring svc/alertmanager 9093:9093配置管理
bash
kubectl get configmap -n monitoring
kubectl describe configmap alertmanager-config -n monitoring
kubectl edit configmap alertmanager-config -n monitoring
kubectl apply -f alertmanager-config.yaml
kubectl rollout restart deployment/alertmanager -n monitoring告警管理
bash
# 查看当前告警
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts
# 查看静默规则
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/silences
# 查看状态
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/status
# 测试告警路由
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- --post-data='[{"labels":{"alertname":"TestAlert","severity":"warning"}}]' http://localhost:9093/api/v1/routesPrometheus告警查看
bash
# 查看告警规则
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules
# 查看当前告警
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alerts
# 查看AlertManager配置
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alertmanagers故障排查指南
问题1:告警未触发
症状
- 告警规则存在但未触发
- Prometheus中看不到告警
排查步骤
bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- 'http://localhost:9090/api/v1/query?query=<alert-expression>'
kubectl logs -n monitoring <prometheus-pod> | grep -i alert
kubectl describe configmap prometheus-rules -n monitoring解决方案
- 检查告警规则语法
- 验证PromQL表达式
- 确认指标数据存在
- 检查告警持续时间
问题2:告警未发送
症状
- 告警触发但未收到通知
- AlertManager未收到告警
排查步骤
bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts
kubectl logs -n monitoring <alertmanager-pod> | grep -i error
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alertmanagers
kubectl describe configmap alertmanager-config -n monitoring解决方案
- 检查AlertManager配置
- 验证Prometheus和AlertManager连接
- 确认通知渠道配置正确
- 检查网络策略
问题3:告警噪音过多
症状
- 收到大量重复告警
- 无关紧要的告警过多
排查步骤
bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/alerts
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/silences
kubectl describe configmap alertmanager-config -n monitoring解决方案
- 调整告警分组配置
- 增加告警重复间隔
- 配置告警抑制规则
- 添加告警静默规则
问题4:告警路由错误
症状
- 告警发送到错误的接收者
- 特定告警未按预期路由
排查步骤
bash
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/routes
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- --post-data='[{"labels":{"alertname":"TestAlert","severity":"warning"}}]' http://localhost:9093/api/v1/routes
kubectl describe configmap alertmanager-config -n monitoring解决方案
- 检查路由配置
- 验证标签匹配规则
- 测试告警路由
- 调整路由优先级
问题5:通知渠道失败
症状
- Email发送失败
- Slack通知失败
排查步骤
bash
kubectl logs -n monitoring <alertmanager-pod> | grep -i smtp
kubectl logs -n monitoring <alertmanager-pod> | grep -i slack
kubectl exec -n monitoring <alertmanager-pod> -- cat /etc/alertmanager/alertmanager.yml
kubectl exec -n monitoring <alertmanager-pod> -- wget -qO- http://localhost:9093/api/v1/status解决方案
- 检查SMTP配置
- 验证Slack Webhook URL
- 测试网络连接
- 检查认证信息
最佳实践
1. 告警规则最佳实践
告警分级
yaml
severity:
- critical: 立即处理,影响业务
- warning: 尽快处理,潜在影响
- info: 关注即可,不影响业务告警命名规范
yaml
命名格式: <组件><问题><级别>
示例:
- NodeHighCPU
- PodCrashLooping
- ServiceDown告警内容规范
yaml
annotations:
summary: 简短描述,包含关键信息
description: 详细描述,包含具体数值和影响范围
runbook_url: 故障处理手册链接2. 告警路由最佳实践
分层路由
yaml
route:
routes:
- match:
severity: critical
receiver: critical-receiver
- match:
severity: warning
receiver: warning-receiver
- match:
namespace: production
receiver: prod-receiver合理分组
yaml
route:
group_by: ['alertname', 'namespace', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h3. 告警抑制最佳实践
层级抑制
yaml
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'namespace']
- source_match:
alertname: NodeDown
target_match_re:
alertname: '.*'
equal: ['node']4. 告警静默最佳实践
维护窗口静默
bash
curl -X POST http://alertmanager:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "node", "value": "node-1", "isRegex": false}
],
"startsAt": "2024-01-15T10:00:00Z",
"endsAt": "2024-01-15T12:00:00Z",
"createdBy": "admin@example.com",
"comment": "Planned maintenance"
}'5. 通知渠道最佳实践
多渠道通知
yaml
receivers:
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com'
slack_configs:
- channel: '#critical-alerts'
pagerduty_configs:
- service_key: '<key>'通知内容优化
yaml
slack_configs:
- title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'总结
本章详细介绍了Kubernetes告警管理的核心概念和实践方法:
- 架构理解: 掌握了AlertManager的核心组件和工作原理
- 部署配置: 学会了AlertManager的部署和配置方法
- 告警规则: 理解了告警规则的设计和配置
- 通知渠道: 掌握了多种通知渠道的配置方法
- 实践应用: 通过实际案例掌握了完整的告警方案
- 故障排查: 掌握了常见问题的诊断和解决方法
告警管理是Kubernetes运维的关键环节,为及时发现和处理系统异常提供了重要保障。
下一步学习
- Prometheus监控 - 深入学习Prometheus监控系统
- Grafana可视化 - 掌握Grafana可视化面板设计
- 日志管理 - 学习Kubernetes日志收集和分析
- 指标基础 - 回顾K8S监控体系基础知识