Skip to content

日志管理

概述

Kubernetes日志管理是运维工作中的重要环节,通过有效的日志收集、存储和分析,可以快速定位问题、监控系统状态并优化应用性能。本章将深入介绍Kubernetes日志管理体系,包括日志收集、日志分析和日志存储的最佳实践。

Kubernetes日志架构

日志类型

1. 容器日志

容器标准输出和标准错误日志,由容器运行时管理。

2. 应用日志

应用程序自身产生的日志文件,通常写入容器内部文件系统。

3. 系统日志

Kubernetes组件日志,如kubelet、kube-proxy等。

4. 审计日志

Kubernetes API Server审计日志,记录API操作。

日志架构图

┌─────────────────────────────────────────────────────────┐
│                    应用容器                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│  │ stdout   │  │ stderr   │  │ log file │             │
│  └──────────┘  └──────────┘  └──────────┘             │
└─────────────────────────────────────────────────────────┘
         ↓                ↓                ↓
┌─────────────────────────────────────────────────────────┐
│              日志收集代理(DaemonSet)                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│  │ Fluentd  │  │ Fluentbit│  │ Filebeat │             │
│  └──────────┘  └──────────┘  └──────────┘             │
└─────────────────────────────────────────────────────────┘
         ↓                ↓                ↓
┌─────────────────────────────────────────────────────────┐
│                  日志存储系统                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│  │  Loki    │  │Elasticsearch│ │  Kafka   │             │
│  └──────────┘  └──────────┘  └──────────┘             │
└─────────────────────────────────────────────────────────┘
         ↓                ↓                ↓
┌─────────────────────────────────────────────────────────┐
│                  日志分析工具                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│  │ Grafana  │  │  Kibana  │  │  Loki    │             │
│  │  Loki    │  │          │  │  Query   │             │
│  └──────────┘  └──────────┘  └──────────┘             │
└─────────────────────────────────────────────────────────┘

日志收集方案

方案一:Fluentd + Elasticsearch

1. 部署Elasticsearch

yaml
# elasticsearch.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: logging
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
  labels:
    app: elasticsearch
spec:
  serviceName: elasticsearch
  replicas: 1
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
        ports:
        - containerPort: 9200
          name: rest
        - containerPort: 9300
          name: inter-node
        env:
        - name: cluster.name
          value: k8s-logs
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: "-Xms512m -Xmx512m"
        - name: xpack.security.enabled
          value: "false"
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 2Gi
      volumes:
      - name: data
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  selector:
    app: elasticsearch
  ports:
  - port: 9200
    name: rest
  - port: 9300
    name: inter-node

2. 部署Fluentd

yaml
# fluentd.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups: [""]
  resources:
  - pods
  - namespaces
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluentd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluentd
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      @id in_tail_container_logs
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>
    
    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
      kubernetes_url "#{ENV['KUBERNETES_SERVICE_HOST']}:#{ENV['KUBERNETES_SERVICE_PORT']}"
    </filter>
    
    <match kubernetes.**>
      @type elasticsearch
      @id out_es
      @log_level info
      include_tag_key true
      host elasticsearch
      port 9200
      logstash_format true
      logstash_prefix k8s-logs
      <buffer>
        @type file
        path /var/log/fluentd/buffer/kubernetes
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: "http"
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluentd-config

3. 部署Kibana

yaml
# kibana.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:8.8.0
        ports:
        - containerPort: 5601
        env:
        - name: ELASTICSEARCH_URL
          value: http://elasticsearch:9200
        - name: SERVER_NAME
          value: kibana
        - name: SERVER_HOST
          value: "0.0.0.0"
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: logging
spec:
  type: NodePort
  ports:
  - port: 5601
    targetPort: 5601
    nodePort: 30601
  selector:
    app: kibana

部署命令

bash
kubectl apply -f elasticsearch.yaml
kubectl apply -f fluentd.yaml
kubectl apply -f kibana.yaml

kubectl get pods -n logging
kubectl get svc -n logging

kubectl port-forward -n logging svc/kibana 5601:5601

方案二:Promtail + Loki

1. 部署Loki

yaml
# loki.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-config
  namespace: logging
data:
  loki.yaml: |
    auth_enabled: false
    ingester:
      chunk_idle_period: 3m
      chunk_block_size: 262144
      chunk_retain_period: 1m
      max_transfer_retries: 0
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
    schema_config:
      configs:
      - from: 2020-10-24
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
    storage_config:
      boltdb_shipper:
        active_index_directory: /tmp/loki/boltdb-shipper-active
        cache_location: /tmp/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /tmp/loki/chunks
    compactor:
      working_directory: /tmp/loki/boltdb-shipper-compactor
      shared_store: filesystem
    server:
      http_listen_port: 3100
    chunk_store_config:
      max_look_back_period: 0s
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:2.8.0
        args:
        - -config.file=/etc/loki/loki.yaml
        ports:
        - containerPort: 3100
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: storage
          mountPath: /tmp/loki
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
      volumes:
      - name: config
        configMap:
          name: loki-config
      - name: storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: loki
  namespace: logging
spec:
  ports:
  - port: 3100
    targetPort: http
    name: http
  selector:
    app: loki

2. 部署Promtail

yaml
# promtail.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: logging
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0
    
    positions:
      filename: /tmp/positions.yaml
    
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
        - role: pod
      pipeline_stages:
        - docker: {}
        - match:
            selector: '{app="nginx"}'
            stages:
              - regex:
                  expression: '^(?P<remote_addr>[\d\.]+) - (?P<remote_user>\S+) \[(?P<time_local>[^\]]+)\] "(?P<request>[^"]+)" (?P<status>\d+) (?P<body_bytes_sent>\d+)'
              - labels:
                  remote_addr:
                  status:
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_label_app]
          target_label: app
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: logging
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    metadata:
      labels:
        app: promtail
    spec:
      containers:
      - name: promtail
        image: grafana/promtail:2.8.0
        args:
        - -config.file=/etc/promtail/promtail.yaml
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

部署命令

bash
kubectl apply -f loki.yaml
kubectl apply -f promtail.yaml

kubectl get pods -n logging
kubectl get svc -n logging

kubectl port-forward -n logging svc/loki 3100:3100

日志分析实践

示例1:应用日志收集配置

yaml
# app-with-logging.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-logging-demo
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-logging
  template:
    metadata:
      labels:
        app: nginx-logging
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9113"
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
        volumeMounts:
        - name: nginx-config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf
        - name: log-volume
          mountPath: /var/log/nginx
      - name: log-exporter
        image: prom/nginxlog-exporter:latest
        args:
        - -listen-address=:9113
        - -nginx-log-path=/var/log/nginx/access.log
        volumeMounts:
        - name: log-volume
          mountPath: /var/log/nginx
          readOnly: true
      volumes:
      - name: nginx-config
        configMap:
          name: nginx-config
      - name: log-volume
        emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: default
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /var/log/nginx/error.log warn;
    pid        /var/run/nginx.pid;
    
    events {
        worker_connections  1024;
    }
    
    http {
        include       /etc/nginx/mime.types;
        default_type  application/octet-stream;
        
        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';
        
        access_log  /var/log/nginx/access.log  main;
        
        sendfile        on;
        keepalive_timeout  65;
        
        server {
            listen       80;
            server_name  localhost;
            
            location / {
                root   /usr/share/nginx/html;
                index  index.html index.htm;
            }
        }
    }
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-logging
  namespace: default
spec:
  selector:
    app: nginx-logging
  ports:
  - port: 80
    targetPort: 80
    name: http
  - port: 9113
    targetPort: 9113
    name: metrics

示例2:日志聚合分析

yaml
# log-aggregation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: log-aggregator
  namespace: logging
data:
  aggregate.py: |
    import json
    import sys
    from collections import Counter
    from datetime import datetime
    
    def parse_log_line(line):
        try:
            return json.loads(line)
        except:
            return None
    
    def aggregate_logs(log_file):
        error_counter = Counter()
        status_counter = Counter()
        time_series = {}
        
        with open(log_file, 'r') as f:
            for line in f:
                log = parse_log_line(line)
                if not log:
                    continue
                
                if 'status' in log:
                    status_counter[log['status']] += 1
                
                if 'level' in log:
                    error_counter[log['level']] += 1
                
                if 'timestamp' in log:
                    hour = datetime.fromisoformat(log['timestamp']).hour
                    time_series[hour] = time_series.get(hour, 0) + 1
        
        print("Status Code Distribution:")
        for status, count in status_counter.most_common():
            print(f"  {status}: {count}")
        
        print("\nLog Level Distribution:")
        for level, count in error_counter.most_common():
            print(f"  {level}: {count}")
        
        print("\nRequests per Hour:")
        for hour in sorted(time_series.keys()):
            print(f"  {hour}:00 - {time_series[hour]} requests")
    
    if __name__ == '__main__':
        aggregate_logs(sys.argv[1])
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: log-aggregator
  namespace: logging
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: aggregator
            image: python:3.9-slim
            command:
            - python
            - /scripts/aggregate.py
            - /var/log/nginx/access.log
            volumeMounts:
            - name: script
              mountPath: /scripts
            - name: logs
              mountPath: /var/log/nginx
              readOnly: true
          volumes:
          - name: script
            configMap:
              name: log-aggregator
          - name: logs
            emptyDir: {}
          restartPolicy: OnFailure

示例3:日志告警规则

yaml
# log-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: log-alert-rules
  namespace: logging
data:
  alerts.yaml: |
    groups:
    - name: log-alerts
      rules:
      - alert: HighErrorRate
        expr: |
          sum(rate({app="nginx"} |= "error" [5m])) 
          / 
          sum(rate({app="nginx"} [5m])) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"
      
      - alert: LogVolumeHigh
        expr: |
          sum(rate({namespace="default"} [5m])) > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High log volume detected"
          description: "Log volume is {{ $value }} logs/s"
      
      - alert: ApplicationCrash
        expr: |
          count_over_time({app="nginx"} |= "panic" [5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Application crash detected"
          description: "Found panic in logs"

kubectl日志操作命令

基本日志查看

bash
kubectl logs <pod-name>

kubectl logs <pod-name> -n <namespace>

kubectl logs <pod-name> -c <container-name>

kubectl logs <pod-name> --previous

kubectl logs <pod-name> --tail=100

kubectl logs <pod-name> --since=1h

kubectl logs <pod-name> --timestamps

kubectl logs -f <pod-name>

多容器日志

bash
kubectl logs <pod-name> --all-containers

kubectl logs <pod-name> -c container1 -c container2

kubectl logs <pod-name> --max-log-requests=5

日志过滤

bash
kubectl logs <pod-name> | grep ERROR

kubectl logs <pod-name> | grep -E "ERROR|WARN"

kubectl logs <pod-name> | awk '/ERROR/ {print}'

kubectl logs <pod-name> | sed -n '/2024-01-15/,/2024-01-16/p'

日志导出

bash
kubectl logs <pod-name> > pod.log

kubectl logs <pod-name> --since=24h > pod-24h.log

kubectl logs <pod-name> -n <namespace> > namespace-pod.log

日志分析

bash
kubectl logs <pod-name> | wc -l

kubectl logs <pod-name> | grep ERROR | wc -l

kubectl logs <pod-name> | awk '{print $1}' | sort | uniq -c | sort -nr

kubectl logs <pod-name> | grep -oP '"status":\K\d+' | sort | uniq -c

故障排查指南

问题1:日志收集失败

症状

bash
kubectl logs -n logging -l app=fluentd
[error]: [out_es] failed to flush the buffer

排查步骤

bash
kubectl get pods -n logging -l app=fluentd

kubectl logs -n logging <fluentd-pod>

kubectl describe pod -n logging <fluentd-pod>

kubectl exec -n logging <fluentd-pod> -- ls -la /var/log/containers

kubectl exec -n logging <fluentd-pod> -- cat /var/log/fluentd-containers.log.pos

解决方案

  • 检查Fluentd配置文件
  • 验证Elasticsearch连接
  • 检查存储权限
  • 查看资源限制

问题2:日志丢失

症状

  • 部分日志未收集
  • 日志顺序错乱

排查步骤

bash
kubectl exec -n logging <fluentd-pod> -- cat /var/log/fluentd-containers.log.pos

kubectl exec -n logging <fluentd-pod> -- ls -la /var/log/containers

kubectl logs <app-pod> | wc -l

kubectl exec -n logging <fluentd-pod> -- cat /var/log/fluentd/buffer/kubernetes/*.log

解决方案

  • 增加Fluentd buffer大小
  • 调整flush间隔
  • 检查磁盘空间
  • 验证日志格式

问题3:Elasticsearch性能问题

症状

bash
kubectl logs -n logging <elasticsearch-pod>
[o.e.m.j.JvmGcMonitorService] [node-1] [gc][old] allocation, failure

排查步骤

bash
kubectl top pods -n logging

kubectl exec -n logging <elasticsearch-pod> -- curl -X GET "localhost:9200/_cat/indices?v"

kubectl exec -n logging <elasticsearch-pod> -- curl -X GET "localhost:9200/_cluster/health?pretty"

kubectl describe pod -n logging <elasticsearch-pod>

解决方案

yaml
env:
- name: ES_JAVA_OPTS
  value: "-Xms2g -Xmx2g"
resources:
  requests:
    cpu: 1000m
    memory: 4Gi
  limits:
    cpu: 2000m
    memory: 8Gi

问题4:日志查询慢

症状

  • Kibana查询超时
  • Loki查询缓慢

排查步骤

bash
kubectl top pods -n logging

kubectl exec -n logging <loki-pod> -- df -h

kubectl logs -n logging <loki-pod> | grep slow

kubectl exec -n logging <loki-pod> -- curl http://localhost:3100/metrics

解决方案

  • 优化索引配置
  • 增加缓存
  • 调整查询时间范围
  • 使用标签过滤

问题5:存储空间不足

症状

bash
kubectl logs -n logging <elasticsearch-pod>
[error] no space left on device

排查步骤

bash
kubectl get pvc -n logging

kubectl describe pvc -n logging

kubectl exec -n logging <elasticsearch-pod> -- df -h

kubectl exec -n logging <elasticsearch-pod> -- curl -X GET "localhost:9200/_cat/indices?v&h=index,store.size,docs.count"

解决方案

  • 配置索引生命周期管理(ILM)
  • 删除旧索引
  • 扩展存储容量
  • 配置日志轮转

最佳实践

1. 日志格式最佳实践

结构化日志

json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "message": "Request processed successfully",
  "service": "user-service",
  "trace_id": "abc123",
  "user_id": "user001",
  "duration_ms": 150,
  "status": 200
}

日志级别规范

yaml
日志级别:
  - DEBUG: 详细调试信息
  - INFO: 常规信息
  - WARN: 警告信息
  - ERROR: 错误信息
  - FATAL: 致命错误

2. 日志收集最佳实践

资源配置

yaml
resources:
  requests:
    cpu: 100m
    memory: 200Mi
  limits:
    cpu: 500m
    memory: 500Mi

缓冲配置

yaml
<buffer>
  @type file
  path /var/log/fluentd/buffer
  flush_mode interval
  flush_interval 5s
  retry_type exponential_backoff
  retry_max_interval 30
  chunk_limit_size 2M
  queue_limit_length 8
</buffer>

3. 日志存储最佳实践

索引生命周期管理

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ilm-policy
  namespace: logging
data:
  ilm-policy.json: |
    {
      "policy": {
        "phases": {
          "hot": {
            "min_age": "0ms",
            "actions": {
              "rollover": {
                "max_size": "50GB",
                "max_age": "1d"
              }
            }
          },
          "warm": {
            "min_age": "7d",
            "actions": {
              "forcemerge": {
                "max_num_segments": 1
              },
              "shrink": {
                "number_of_shards": 1
              }
            }
          },
          "cold": {
            "min_age": "30d",
            "actions": {
              "freeze": {}
            }
          },
          "delete": {
            "min_age": "90d",
            "actions": {
              "delete": {}
            }
          }
        }
      }
    }

数据保留策略

yaml
保留策略:
  - 热数据: 7天(SSD存储)
  - 温数据: 30天(HDD存储)
  - 冷数据: 90天(归档存储)
  - 删除: 超过90天

4. 日志查询最佳实践

Loki查询示例

promql
{app="nginx"} |= "error"

{namespace="default"} |~ "error|warn"

{app="nginx"} | json | status >= 500

sum by (status) (count_over_time({app="nginx"} | json [1h]))

Elasticsearch查询示例

json
{
  "query": {
    "bool": {
      "must": [
        {"match": {"level": "ERROR"}},
        {"range": {"timestamp": {"gte": "now-1h"}}}
      ]
    }
  }
}

5. 安全最佳实践

日志脱敏

yaml
<filter kubernetes.**>
  @type record_transformer
  enable_ruby true
  <record>
    message ${record["message"].gsub(/password=\S+/, 'password=***')}
  </record>
</filter>

访问控制

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: log-reader
rules:
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]

网络策略

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: logging-network-policy
  namespace: logging
spec:
  podSelector:
    matchLabels:
      app: elasticsearch
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: logging
    ports:
    - protocol: TCP
      port: 9200

日志分析工具

Loki查询语法

基本查询

promql
{app="nginx"}

{namespace="default", app="nginx"}

{app=~"nginx-.*"}

日志过滤

promql
{app="nginx"} |= "error"

{app="nginx"} != "debug"

{app="nginx"} |~ "error|warn"

JSON解析

promql
{app="nginx"} | json

{app="nginx"} | json | level="error"

{app="nginx"} | json | status >= 500

聚合查询

promql
count_over_time({app="nginx"}[1h])

rate({app="nginx"}[5m])

sum by (status) (count_over_time({app="nginx"} | json [1h]))

Kibana可视化

创建索引模式

1. Kibana -> Management -> Index Patterns
2. Create index pattern: k8s-logs-*
3. Select time field: @timestamp

创建Dashboard

1. Kibana -> Dashboard -> Create dashboard
2. Add visualization
3. Select index pattern
4. Configure aggregation
5. Save dashboard

总结

本章详细介绍了Kubernetes日志管理的核心概念和实践方法:

  1. 日志架构: 理解了Kubernetes日志类型和架构设计
  2. 日志收集: 学会了Fluentd和Promtail两种主流方案
  3. 日志存储: 掌握了Elasticsearch和Loki的部署配置
  4. 日志分析: 理解了日志查询和分析的方法
  5. 实践应用: 通过实际案例掌握了完整的日志方案
  6. 故障排查: 掌握了常见问题的诊断和解决方法

日志管理是Kubernetes运维的重要环节,为问题定位和系统优化提供了重要支持。

下一步学习