Prometheus
概述
Prometheus是一个开源的系统监控和告警工具包,最初由SoundCloud开发,现在是CNCF的毕业项目。它具有强大的多维数据模型、灵活的查询语言(PromQL)以及完善的告警机制,是Kubernetes生态中最流行的监控解决方案。
Prometheus架构
核心组件
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 检索模块 │ │ TSDB存储 │ │ HTTP Server │ │
│ │ Retrieval │ │ Time Series │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Exporter │ │ Service │ │ AlertManager │
│ (指标导出) │ │ Discovery │ │ (告警管理) │
└─────────────┘ └──────────────┘ └──────────────────┘
↓ ↓ ↓
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Pushgateway │ │ Grafana │ │ 通知渠道 │
│ (推送网关) │ │ (可视化) │ │ (Email/Slack等) │
└─────────────┘ └──────────────┘ └──────────────────┘数据模型
Prometheus将所有数据存储为时间序列,每个时间序列由指标名称和标签唯一标识:
<metric name>{<label name>=<label value>, ...}示例:
http_requests_total{method="GET", handler="/api/v1/users", status="200"}部署Prometheus
方式一:使用ConfigMap配置
1. 创建命名空间和RBAC
yaml
# prometheus-rbac.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring2. 创建Prometheus配置
yaml
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
monitor: 'k8s-monitor'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.rules
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-nodes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name3. 创建告警规则
yaml
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
alert.rules: |
groups:
- name: node-alerts
rules:
- alert: NodeHighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: NodeDiskRunningLow
expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"
- name: pod-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in last 15 minutes"
- alert: PodNotReady
expr: kube_pod_status_phase{phase=~"Pending|Unknown"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in {{ $labels.phase }} state for more than 10 minutes"
- alert: ContainerHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container high CPU usage"
description: "Container {{ $labels.namespace }}/{{ $labels.pod }} CPU usage is {{ $value }}%"
- name: kubernetes-alerts
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been unready for more than 10 minutes"
- alert: KubernetesUnreachableNodes
expr: count(kube_node_status_condition{condition="Ready",status="unknown"} == 1) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes unreachable nodes"
description: "There are {{ $value }} unreachable nodes"4. 部署Prometheus
yaml
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
name: web
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
- name: storage
mountPath: /prometheus
resources:
requests:
cpu: 500m
memory: 500Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /-/healthy
port: web
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: web
initialDelaySeconds: 5
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: NodePort
ports:
- port: 9090
targetPort: web
nodePort: 30090
name: web
selector:
app: prometheus部署命令
bash
kubectl apply -f prometheus-rbac.yaml
kubectl apply -f prometheus-config.yaml
kubectl apply -f prometheus-rules.yaml
kubectl apply -f prometheus-deployment.yaml
kubectl get pods -n monitoring -l app=prometheus
kubectl get svc -n monitoring prometheus方式二:使用Helm部署
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
--create-namespace \
--set server.persistentVolume.enabled=true \
--set server.persistentVolume.size=10Gi \
--set server.retention="15d"指标收集
Node Exporter
部署Node Exporter
yaml
# node-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.6.0
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/host/root'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- name: metrics
containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 200Mi
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /部署并验证
bash
kubectl apply -f node-exporter.yaml
kubectl get pods -n monitoring -l app=node-exporter
kubectl logs -n monitoring -l app=node-exporter应用指标导出
示例:为应用添加Prometheus指标
yaml
# app-with-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: nginx:latest
ports:
- containerPort: 8080
env:
- name: PORT
value: "8080"
---
apiVersion: v1
kind: Service
metadata:
name: sample-app
namespace: monitoring
labels:
app: sample-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
selector:
app: sample-app
ports:
- port: 8080
targetPort: 8080
name: metrics自定义应用指标示例(Python)
python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import random
import time
REQUEST_COUNT = Counter(
'app_request_count',
'Application Request Count',
['method', 'endpoint', 'http_status']
)
REQUEST_LATENCY = Histogram(
'app_request_latency_seconds',
'Application Request Latency',
['endpoint']
)
IN_PROGRESS = Gauge(
'app_in_progress',
'Number of in-progress requests'
)
def simulate_request():
IN_PROGRESS.inc()
start_time = time.time()
time.sleep(random.uniform(0.1, 0.5))
REQUEST_COUNT.labels(
method='GET',
endpoint='/api/v1/users',
http_status='200'
).inc()
REQUEST_LATENCY.labels(endpoint='/api/v1/users').observe(time.time() - start_time)
IN_PROGRESS.dec()
if __name__ == '__main__':
start_http_server(8080)
while True:
simulate_request()
time.sleep(1)Dockerfile
dockerfile
FROM python:3.9-slim
WORKDIR /app
RUN pip install prometheus-client
COPY app.py .
EXPOSE 8080
CMD ["python", "app.py"]服务发现
Kubernetes服务发现配置
1. Endpoints服务发现
yaml
scrape_configs:
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$22. Pod服务发现
yaml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__3. Node服务发现
yaml
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)4. Service服务发现
yaml
scrape_configs:
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name服务发现标签
常用元数据标签
__meta_kubernetes_namespace
__meta_kubernetes_pod_name
__meta_kubernetes_pod_container_name
__meta_kubernetes_pod_label_<labelname>
__meta_kubernetes_pod_annotation_<annotationname>
__meta_kubernetes_service_name
__meta_kubernetes_service_label_<labelname>
__meta_kubernetes_node_name
__meta_kubernetes_node_label_<labelname>PromQL查询语言
基础查询
1. 即时向量查询
promql
http_requests_total
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}2. 范围向量查询
promql
http_requests_total[5m]
http_requests_total{method="GET"}[1h]
http_requests_total[5m] offset 1h聚合操作
1. 求和
promql
sum(http_requests_total)
sum by (method) (http_requests_total)
sum without (instance) (http_requests_total)2. 平均值
promql
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)3. 最大值/最小值
promql
max(node_memory_MemTotal_bytes)
min(node_memory_MemAvailable_bytes)4. 计数
promql
count(up)
count by (job) (up == 0)函数操作
1. 速率计算
promql
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase(http_requests_total[1h])2. 数学运算
promql
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 1003. 时间操作
promql
time()
hour()
day_of_week()
days_in_month()4. 变化率
promql
delta(cpu_temp_celsius[1h])
deriv(cpu_temp_celsius[1h])高级查询示例
1. CPU使用率
promql
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)2. 内存使用率
promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 1003. 磁盘使用率
promql
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 1004. 网络流量
promql
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])5. 容器CPU使用率
promql
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)6. 容器内存使用
promql
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)7. HTTP请求成功率
promql
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 1008. P95响应时间
promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))告警规则
告警规则配置
1. 节点告警规则
yaml
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 5 minutes"
- alert: NodeHighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: NodeDiskRunningLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Only {{ $value }}% disk space available"
- alert: NodeHighLoad
expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.instance }}"
description: "Load average is {{ $value }}"2. Pod告警规则
yaml
- name: pod-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times"
- alert: PodNotReady
expr: kube_pod_status_phase{phase=~"Pending|Unknown"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state"
- alert: PodHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage for pod {{ $labels.pod }}"
description: "CPU usage is {{ $value }}%"
- alert: PodHighMemory
expr: sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace) / sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage for pod {{ $labels.pod }}"
description: "Memory usage is {{ $value }}%"
- alert: ContainerOomKilled
expr: increase(container_oom_events_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Container OOM killed"
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed"3. 应用告警规则
yaml
- name: application-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%"
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: LowThroughput
expr: sum(rate(http_requests_total[5m])) < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low throughput detected"
description: "Throughput is {{ $value }} requests/s"
- alert: ServiceDown
expr: up{job="app"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Service has been down for more than 5 minutes"告警规则最佳实践
1. 告警分级
yaml
severity:
- critical: 需要立即处理,影响业务
- warning: 需要关注,可能影响业务
- info: 信息提示,不影响业务2. 告警命名规范
yaml
命名格式: <组件><问题><级别>
示例:
- NodeHighCPU
- PodCrashLooping
- ServiceDown3. 告警内容规范
yaml
annotations:
summary: 简短描述,包含关键信息
description: 详细描述,包含具体数值和影响范围
runbook_url: 故障处理手册链接实践示例
示例1:完整监控栈部署
yaml
# complete-monitoring-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
selector:
app: prometheus
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.6.0
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
ports:
- containerPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys部署命令
bash
kubectl apply -f complete-monitoring-stack.yaml
kubectl get all -n monitoring
kubectl port-forward -n monitoring svc/prometheus 9090:9090示例2:应用监控配置
yaml
# app-monitoring.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: web
image: nginx:latest
ports:
- containerPort: 8080
name: http
- containerPort: 8080
name: metrics
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: web-app
namespace: default
labels:
app: web-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
name: http
- port: 8080
targetPort: 8080
name: metrics
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app
namespace: default
spec:
rules:
- host: web-app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app
port:
number: 80示例3:自定义指标应用
yaml
# custom-metrics-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-metrics-app
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: custom-metrics-app
template:
metadata:
labels:
app: custom-metrics-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
containers:
- name: app
image: python:3.9-slim
command:
- /bin/sh
- -c
- |
pip install prometheus-client
python <<EOF
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import random
import time
REQUEST_COUNT = Counter('app_request_count', 'Request Count', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('app_request_latency_seconds', 'Request Latency')
ACTIVE_CONNECTIONS = Gauge('app_active_connections', 'Active Connections')
def simulate():
REQUEST_COUNT.labels(method='GET', endpoint='/api').inc()
REQUEST_LATENCY.observe(random.uniform(0.1, 0.5))
ACTIVE_CONNECTIONS.set(random.randint(1, 100))
start_http_server(8080)
while True:
simulate()
time.sleep(1)
EOF
ports:
- containerPort: 8080
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
name: custom-metrics-app
namespace: monitoring
spec:
selector:
app: custom-metrics-app
ports:
- port: 8080
targetPort: 8080kubectl操作命令
Prometheus资源管理
bash
kubectl get all -n monitoring
kubectl get pods -n monitoring -l app=prometheus
kubectl logs -n monitoring -l app=prometheus -f
kubectl describe pod -n monitoring -l app=prometheus
kubectl exec -it -n monitoring <prometheus-pod> -- sh
kubectl port-forward -n monitoring svc/prometheus 9090:9090配置管理
bash
kubectl get configmap -n monitoring
kubectl describe configmap prometheus-config -n monitoring
kubectl edit configmap prometheus-config -n monitoring
kubectl apply -f prometheus-config.yaml
kubectl rollout restart deployment/prometheus -n monitoring服务发现调试
bash
kubectl get endpoints -n monitoring
kubectl get servicemonitor -n monitoring
kubectl get podmonitors -n monitoring
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/targets
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/targets | jq .指标查询
bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- 'http://localhost:9090/api/v1/query?query=up'
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total'
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- 'http://localhost:9090/api/v1/query_range?query=up&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z&step=60s'告警管理
bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alerts
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alertmanagers故障排查指南
问题1:Prometheus无法启动
症状
bash
kubectl get pods -n monitoring -l app=prometheus
NAME READY STATUS RESTARTS AGE
prometheus-xxx 0/1 CrashLoopBackOff 5 10m排查步骤
bash
kubectl logs -n monitoring -l app=prometheus
kubectl describe pod -n monitoring -l app=prometheus
kubectl get events -n monitoring --sort-by='.lastTimestamp'解决方案
- 检查配置文件语法
- 验证RBAC权限
- 检查存储卷挂载
- 查看资源限制
问题2:无法收集指标
症状
- Prometheus UI中targets显示为down
- 指标数据缺失
排查步骤
bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/targets
kubectl get pods -n monitoring -o wide
kubectl logs -n monitoring <node-exporter-pod>
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://<target-ip>:9100/metrics解决方案
- 检查网络策略
- 验证服务发现配置
- 确认目标端口正确
- 检查标签和注解
问题3:存储问题
症状
bash
kubectl logs -n monitoring <prometheus-pod>
level=error ts=2024-01-15T10:30:00Z caller=db.go:123 component=TSDB msg="compaction failed" err="no space left on device"排查步骤
bash
kubectl exec -n monitoring <prometheus-pod> -- df -h
kubectl exec -n monitoring <prometheus-pod> -- du -sh /prometheus/*
kubectl get pvc -n monitoring解决方案
yaml
spec:
containers:
- name: prometheus
args:
- "--storage.tsdb.retention.time=7d"
- "--storage.tsdb.retention.size=5GB"问题4:查询性能慢
症状
- Prometheus UI响应缓慢
- 查询超时
排查步骤
bash
kubectl top pods -n monitoring
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/status/tsdb
kubectl logs -n monitoring <prometheus-pod> | grep "slow query"解决方案
- 增加资源限制
- 优化查询语句
- 减少数据保留时间
- 使用Recording Rules
问题5:告警不触发
症状
- 告警规则存在但未触发
- AlertManager未收到告警
排查步骤
bash
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/rules
kubectl exec -n monitoring <prometheus-pod> -- wget -qO- http://localhost:9090/api/v1/alerts
kubectl logs -n monitoring <prometheus-pod> | grep alert
kubectl describe configmap prometheus-config -n monitoring解决方案
- 检查告警规则语法
- 验证AlertManager配置
- 确认告警条件满足
- 检查告警路由配置
最佳实践
1. 配置最佳实践
合理设置采集间隔
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s使用外部标签
yaml
global:
external_labels:
cluster: 'production'
environment: 'prod'
region: 'us-east-1'配置数据保留
yaml
args:
- "--storage.tsdb.retention.time=15d"
- "--storage.tsdb.retention.size=10GB"2. 查询最佳实践
使用Recording Rules
yaml
groups:
- name: recording-rules
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_requests:rate1h
expr: sum by (job) (rate(http_requests_total[1h]))避免高基数标签
yaml
避免使用:
- user_id
- session_id
- request_id
推荐使用:
- job
- instance
- method
- status3. 告警最佳实践
告警分级
yaml
labels:
severity: critical
labels:
severity: warning
labels:
severity: info设置合理的持续时间
yaml
for: 5m
for: 10m
for: 1h提供详细的告警信息
yaml
annotations:
summary: "简短描述"
description: "详细描述"
runbook_url: "https://wiki.example.com/runbook/alert-name"4. 性能优化最佳实践
资源配置
yaml
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi存储优化
yaml
volumeClaimTemplate:
- metadata:
name: prometheus-storage
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi查询优化
yaml
使用Recording Rules预计算
避免使用大范围查询
合理使用聚合函数5. 安全最佳实践
RBAC权限控制
yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "services", "endpoints"]
verbs: ["get", "list", "watch"]网络策略
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}TLS配置
yaml
web:
tls_server_config:
cert_file: /etc/tls/cert.pem
key_file: /etc/tls/key.pem总结
本章详细介绍了Prometheus监控系统的核心概念和实践方法:
- 架构理解: 掌握了Prometheus的核心组件和数据模型
- 部署配置: 学会了多种部署方式和配置管理
- 指标收集: 理解了服务发现机制和指标导出方法
- PromQL查询: 掌握了强大的查询语言和常用查询模式
- 告警规则: 学会了配置和管理告警规则
- 实践应用: 通过实际案例掌握了完整的监控方案
- 故障排查: 掌握了常见问题的诊断和解决方法
Prometheus是Kubernetes监控的核心工具,为后续的Grafana可视化和告警管理奠定了基础。
下一步学习
- Grafana可视化 - 学习Grafana可视化面板设计
- 日志管理 - 掌握Kubernetes日志收集和分析
- 告警管理 - 配置AlertManager告警系统
- 指标基础 - 回顾K8S监控体系基础知识