Skip to content

灾备恢复

概述

灾备恢复是Kubernetes生产环境的关键能力,本章将深入探讨如何构建可靠的备份和灾难恢复体系,包括备份策略、灾难恢复、多集群管理等内容。

核心概念

灾备恢复目标

  • RPO(恢复点目标):数据丢失的最大时间范围
  • RTO(恢复时间目标):系统恢复的最大时间
  • 数据一致性:备份数据的完整性
  • 业务连续性:最小化业务中断时间

备份类型

  • 完整备份:全量数据备份
  • 增量备份:仅备份变化数据
  • 差异备份:相对于基准备份的变化
  • 快照备份:存储级别的快速备份

灾难恢复策略

  • 冷备:定期备份,手动恢复
  • 温备:定期备份,半自动恢复
  • 热备:实时同步,自动切换
  • 多活:多集群同时服务

备份策略

ETCD备份

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "*/30 * * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeSelector:
            node-role.kubernetes.io/master: ""
          containers:
          - name: etcd-backup
            image: bitnami/etcd:latest
            command:
            - /bin/sh
            - -c
            - |
              ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/server.crt \
                --key=/etc/kubernetes/pki/etcd/server.key \
                --endpoints=https://127.0.0.1:2379
              
              find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup
              mountPath: /backup
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: Directory
          - name: backup
            persistentVolumeClaim:
              claimName: etcd-backup-pvc
          restartPolicy: OnFailure
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: etcd-backup-pvc
  namespace: kube-system
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Velero备份

yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-backup
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-k8s-backups
    prefix: velero
  config:
    region: us-west-2
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: aws-snapshots
  namespace: velero
spec:
  provider: aws
  config:
    region: us-west-2
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
    - production
    - staging
    excludedResources:
    - events
    - pods
    storageLocation: aws-backup
    volumeSnapshotLocations:
    - aws-snapshots
    ttl: 720h
    hooks:
      resources:
      - name: pre-backup-hook
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: mysql
        pre:
        - exec:
            container: mysql
            command:
            - /bin/sh
            - -c
            - "mysqldump --all-databases > /backup/dump.sql"
            onError: Continue
            timeout: 300s
---
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: production-backup
  namespace: velero
spec:
  includedNamespaces:
  - production
  excludedResources:
  - events
  storageLocation: aws-backup
  volumeSnapshotLocations:
  - aws-snapshots
  ttl: 720h
  hooks:
    resources:
    - name: pre-backup-hook
      includedNamespaces:
      - production
      labelSelector:
        matchLabels:
          backup: enabled
      pre:
      - exec:
          container: app
          command:
          - /bin/sh
          - -c
          - "sync && echo 3 > /proc/sys/vm/drop_caches"
          onError: Continue
          timeout: 60s

数据库备份

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mysql-backup
  namespace: production
spec:
  schedule: "0 */6 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: mysql-backup
            image: mysql:8.0
            command:
            - /bin/sh
            - -c
            - |
              mysqldump -h mysql -u root -p$MYSQL_ROOT_PASSWORD --all-databases --single-transaction --routines --triggers --events > /backup/mysql-backup-$(date +%Y%m%d-%H%M%S).sql
              gzip /backup/mysql-backup-$(date +%Y%m%d-%H%M%S).sql
              find /backup -name "mysql-backup-*.sql.gz" -mtime +30 -delete
            env:
            - name: MYSQL_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-secret
                  key: root-password
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: mysql-backup-pvc
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mongodb-backup
  namespace: production
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: mongodb-backup
            image: mongo:latest
            command:
            - /bin/sh
            - -c
            - |
              mongodump --uri="mongodb://mongodb:27017" --out=/backup/mongodb-backup-$(date +%Y%m%d-%H%M%S)
              tar -czf /backup/mongodb-backup-$(date +%Y%m%d-%H%M%S).tar.gz -C /backup mongodb-backup-$(date +%Y%m%d-%H%M%S)
              rm -rf /backup/mongodb-backup-$(date +%Y%m%d-%H%M%S)
              find /backup -name "mongodb-backup-*.tar.gz" -mtime +30 -delete
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: mongodb-backup-pvc
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: redis-backup
  namespace: production
spec:
  schedule: "0 */1 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: redis-backup
            image: redis:latest
            command:
            - /bin/sh
            - -c
            - |
              redis-cli -h redis BGSAVE
              sleep 10
              cp /data/dump.rdb /backup/redis-backup-$(date +%Y%m%d-%H%M%S).rdb
              find /backup -name "redis-backup-*.rdb" -mtime +7 -delete
            volumeMounts:
            - name: redis-data
              mountPath: /data
              readOnly: true
            - name: backup
              mountPath: /backup
          volumes:
          - name: redis-data
            persistentVolumeClaim:
              claimName: redis-pvc
          - name: backup
            persistentVolumeClaim:
              claimName: redis-backup-pvc
          restartPolicy: OnFailure

灾难恢复

恢复流程配置

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: disaster-recovery-plan
  namespace: production
data:
  recovery-plan.md: |
    # 灾难恢复计划
    
    ## 恢复优先级
    1. 核心服务恢复(API网关、认证服务)
    2. 数据库恢复(MySQL、MongoDB、Redis)
    3. 业务服务恢复(用户服务、订单服务)
    4. 监控告警恢复(Prometheus、Grafana)
    
    ## 恢复步骤
    
    ### 1. 集群恢复
    ```bash
    # 恢复ETCD
    ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
      --data-dir=/var/lib/etcd
    
    # 重启ETCD
    systemctl restart etcd
    ```
    
    ### 2. 数据库恢复
    ```bash
    # 恢复MySQL
    gunzip < /backup/mysql-backup.sql.gz | mysql -u root -p
    
    # 恢复MongoDB
    tar -xzf /backup/mongodb-backup.tar.gz
    mongorestore /backup/mongodb-backup
    
    # 恢复Redis
    cp /backup/redis-backup.rdb /data/dump.rdb
    ```
    
    ### 3. 应用恢复
    ```bash
    # 使用Velero恢复
    velero restore create --from-backup production-backup
    
    # 或手动恢复
    kubectl apply -f /backup/production/
    ```
    
    ## 验证步骤
    1. 检查所有Pod状态
    2. 验证服务连通性
    3. 执行冒烟测试
    4. 检查数据完整性
---
apiVersion: batch/v1
kind: Job
metadata:
  name: disaster-recovery-test
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: recovery-test
        image: bitnami/kubectl:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Starting disaster recovery test..."
          
          # 检查所有Pod状态
          kubectl get pods -n production
          
          # 验证服务连通性
          kubectl run test-curl --image=curlimages/curl -n production --rm -it -- curl http://user-service:8080/actuator/health
          
          # 执行冒烟测试
          kubectl exec deployment/api-gateway -n production -- curl http://user-service:8080/api/health
          
          # 检查数据完整性
          kubectl exec deployment/mysql -n production -- mysql -u root -p$MYSQL_ROOT_PASSWORD -e "SHOW DATABASES;"
          
          echo "Disaster recovery test completed successfully!"
      restartPolicy: OnFailure

Velero恢复

yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: production-restore
  namespace: velero
spec:
  backupName: production-backup
  includedNamespaces:
  - production
  excludedResources:
  - events
  - pods
  restorePVs: true
  preserveNodePorts: true
  hooks:
    resources:
    - name: post-restore-hook
      includedNamespaces:
      - production
      labelSelector:
        matchLabels:
          app: mysql
      post:
      - exec:
          container: mysql
          command:
          - /bin/sh
          - -c
          - "mysql < /backup/dump.sql"
          onError: Continue
          timeout: 300s

多集群故障转移

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-failover-config
  namespace: production
data:
  clusters.yaml: |
    clusters:
      - name: primary
        endpoint: https://cluster1.example.com
        region: us-west-1
        priority: 1
      - name: secondary
        endpoint: https://cluster2.example.com
        region: us-east-1
        priority: 2
      - name: tertiary
        endpoint: https://cluster3.example.com
        region: eu-west-1
        priority: 3
    
    failover:
      enabled: true
      healthCheckInterval: 30s
      failureThreshold: 3
      recoveryThreshold: 2
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cluster-health-check
  namespace: production
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: health-check
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # 检查主集群健康状态
              if ! kubectl get nodes; then
                echo "Primary cluster is unhealthy, initiating failover..."
                
                # 更新DNS指向备用集群
                # 更新负载均衡器配置
                # 通知运维团队
                
                # 触发故障转移
                kubectl create configmap failover-trigger --from-literal=triggered=true -n production
              fi
          restartPolicy: OnFailure

多集群管理

集群联邦配置

yaml
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
  name: cluster1
  namespace: kube-federation-system
spec:
  apiEndpoint: https://cluster1.example.com
  caBundle: <base64-encoded-ca-bundle>
  tlsClientCert:
    secretName: cluster1-secret
    secretRef:
      name: cluster1-secret
  tlsClientKey:
    secretName: cluster1-secret
    secretRef:
      name: cluster1-secret
---
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
  name: cluster2
  namespace: kube-federation-system
spec:
  apiEndpoint: https://cluster2.example.com
  caBundle: <base64-encoded-ca-bundle>
  tlsClientCert:
    secretName: cluster2-secret
    secretRef:
      name: cluster2-secret
  tlsClientKey:
    secretName: cluster2-secret
    secretRef:
      name: cluster2-secret
---
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    metadata:
      labels:
        app: my-app
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: my-app
      template:
        spec:
          containers:
          - name: my-app
            image: registry.example.com/my-app:v1.0.0
            ports:
            - containerPort: 8080
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  overrides:
  - clusterName: cluster1
    clusterOverrides:
    - path: spec.replicas
      value: 5
  - clusterName: cluster2
    clusterOverrides:
    - path: spec.replicas
      value: 3

跨集群服务发现

yaml
apiVersion: types.kubefed.io/v1beta1
kind: FederatedService
metadata:
  name: my-app
  namespace: production
spec:
  template:
    metadata:
      labels:
        app: my-app
    spec:
      type: LoadBalancer
      selector:
        app: my-app
      ports:
      - port: 80
        targetPort: 8080
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
---
apiVersion: types.kubefed.io/v1beta1
kind: FederatedIngress
metadata:
  name: my-app-ingress
  namespace: production
spec:
  template:
    metadata:
      annotations:
        kubernetes.io/ingress.class: nginx
    spec:
      rules:
      - host: www.example.com
        http:
          paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-app
                port:
                  number: 80
  placement:
    clusters:
    - name: cluster1
    - name: cluster2

kubectl操作命令

备份管理

bash
# ETCD备份
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 查看ETCD备份状态
ETCDCTL_API=3 etcdctl snapshot status snapshot.db

# Velero备份
velero backup create production-backup --include-namespaces production

# 查看Velero备份
velero backup get

# 查看Velero备份详情
velero backup describe production-backup

# 查看Velero备份日志
velero backup logs production-backup

# 创建定时备份
velero schedule create daily-backup --schedule="0 2 * * *"

# 查看定时备份
velero schedule get

# 手动触发备份
velero backup create --from-schedule daily-backup

恢复管理

bash
# ETCD恢复
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd

# Velero恢复
velero restore create --from-backup production-backup

# 查看Velero恢复
velero restore get

# 查看Velero恢复详情
velero restore describe production-restore

# 查看Velero恢复日志
velero restore logs production-restore

# 恢复特定资源
velero restore create --from-backup production-backup --include-resources deployments,services

# 恢复特定命名空间
velero restore create --from-backup production-backup --include-namespaces production

多集群管理

bash
# 查看联邦集群
kubectl get kubefedclusters -n kube-federation-system

# 查看联邦资源
kubectl get federateddeployments -n production

# 查看跨集群服务
kubectl get federatedservices -n production

# 手动触发故障转移
kubectl create configmap failover-trigger --from-literal=triggered=true -n production

# 查看集群健康状态
kubectl get clusters -n production

# 同步资源到所有集群
kubectl apply -f federated-deployment.yaml

数据库备份恢复

bash
# MySQL备份
kubectl exec deployment/mysql -n production -- mysqldump -u root -p$MYSQL_ROOT_PASSWORD --all-databases > mysql-backup.sql

# MySQL恢复
kubectl exec -i deployment/mysql -n production -- mysql -u root -p$MYSQL_ROOT_PASSWORD < mysql-backup.sql

# MongoDB备份
kubectl exec deployment/mongodb -n production -- mongodump --out=/backup

# MongoDB恢复
kubectl exec deployment/mongodb -n production -- mongorestore /backup

# Redis备份
kubectl exec deployment/redis -n production -- redis-cli BGSAVE

# Redis恢复
kubectl cp redis-backup.rdb production/redis-pod:/data/dump.rdb
kubectl exec deployment/redis -n production -- redis-cli SHUTDOWN NOSAVE

实践示例

示例1:完整灾备方案

场景描述

构建一个完整的灾备方案,包括定期备份、自动恢复、故障转移等。

配置文件

yaml
# 备份策略配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: backup-policy
  namespace: production
data:
  policy.yaml: |
    backup:
      etcd:
        enabled: true
        schedule: "*/30 * * * *"
        retention: 7d
      applications:
        enabled: true
        schedule: "0 2 * * *"
        retention: 30d
      databases:
        mysql:
          enabled: true
          schedule: "0 */6 * * *"
          retention: 30d
        mongodb:
          enabled: true
          schedule: "0 */6 * * *"
          retention: 30d
        redis:
          enabled: true
          schedule: "0 */1 * * *"
          retention: 7d
    
    recovery:
      autoRecovery: true
      healthCheckInterval: 30s
      failureThreshold: 3
    
    failover:
      enabled: true
      clusters:
        - name: primary
          endpoint: https://cluster1.example.com
        - name: secondary
          endpoint: https://cluster2.example.com
---
# 自动化备份脚本
apiVersion: batch/v1
kind: CronJob
metadata:
  name: automated-backup
  namespace: production
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Starting automated backup..."
              
              # 备份ETCD
              ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
              
              # 备份应用配置
              kubectl get all -n production -o yaml > /backup/production-$(date +%Y%m%d-%H%M%S).yaml
              
              # 备份数据库
              kubectl exec deployment/mysql -n production -- mysqldump --all-databases > /backup/mysql-$(date +%Y%m%d-%H%M%S).sql
              
              # 上传到对象存储
              aws s3 sync /backup s3://my-k8s-backups/$(date +%Y%m%d)/
              
              # 清理旧备份
              find /backup -mtime +30 -delete
              
              echo "Backup completed successfully!"
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure
---
# 健康检查和自动恢复
apiVersion: batch/v1
kind: CronJob
metadata:
  name: health-check-and-recovery
  namespace: production
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: health-check
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Performing health check..."
              
              # 检查关键服务
              if ! kubectl get pods -n production -l app=critical | grep Running; then
                echo "Critical services are down, initiating recovery..."
                
                # 尝试重启服务
                kubectl rollout restart deployment/critical-service -n production
                
                # 等待恢复
                sleep 60
                
                # 如果仍然失败,触发告警
                if ! kubectl get pods -n production -l app=critical | grep Running; then
                  echo "Recovery failed, triggering alert..."
                  # 发送告警通知
                fi
              fi
              
              echo "Health check completed."
          restartPolicy: OnFailure

示例2:跨区域灾备

场景描述

构建跨区域的灾备方案,实现数据的异地备份和故障转移。

配置文件

yaml
# 跨区域复制配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: cross-region-replication
  namespace: production
data:
  replication.yaml: |
    regions:
      - name: us-west-2
        primary: true
        endpoint: https://cluster-west.example.com
        backupLocation: s3://backup-west
      - name: us-east-1
        primary: false
        endpoint: https://cluster-east.example.com
        backupLocation: s3://backup-east
    
    replication:
      enabled: true
      syncInterval: 5m
      dataTypes:
        - etcd
        - databases
        - applications
---
# 数据同步Job
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cross-region-sync
  namespace: production
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: sync
            image: amazon/aws-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              echo "Starting cross-region sync..."
              
              # 同步备份到异地
              aws s3 sync s3://backup-west s3://backup-east --source-region us-west-2 --dest-region us-east-1
              
              # 验证数据完整性
              aws s3 ls s3://backup-east --recursive | wc -l
              
              echo "Cross-region sync completed."
          restartPolicy: OnFailure
---
# 故障转移配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: failover-config
  namespace: production
data:
  failover.yaml: |
    triggers:
      - type: cluster-down
        threshold: 3
        action: switch-to-secondary
      - type: data-loss
        threshold: 1
        action: restore-from-backup
      - type: performance-degradation
        threshold: 5
        action: scale-out
    
    notifications:
      - type: email
        recipients:
          - ops@example.com
      - type: slack
        channel: "#alerts"

示例3:蓝绿灾备切换

场景描述

使用蓝绿部署模式实现灾备切换,确保零停机时间。

配置文件

yaml
# 蓝绿灾备配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: blue-green-dr
  namespace: production
data:
  blue-green.yaml: |
    environments:
      blue:
        cluster: cluster-blue
        endpoint: https://blue.example.com
        active: true
      green:
        cluster: cluster-green
        endpoint: https://green.example.com
        active: false
    
    switchStrategy: rolling
    healthCheckTimeout: 300s
    rollbackEnabled: true
---
# DNS切换脚本
apiVersion: batch/v1
kind: Job
metadata:
  name: dns-switch
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: dns-switch
        image: amazon/aws-cli:latest
        command:
        - /bin/sh
        - -c
        - |
          echo "Switching DNS to green environment..."
          
          # 更新Route53记录
          aws route53 change-resource-record-sets \
            --hosted-zone-id $HOSTED_ZONE_ID \
            --change-batch '{
              "Changes": [
                {
                  "Action": "UPSERT",
                  "ResourceRecordSet": {
                    "Name": "www.example.com",
                    "Type": "CNAME",
                    "TTL": 60,
                    "ResourceRecords": [{"Value": "green.example.com"}]
                  }
                }
              ]
            }'
          
          echo "DNS switch completed."
        env:
        - name: HOSTED_ZONE_ID
          valueFrom:
            configMapKeyRef:
              name: dns-config
              key: hosted-zone-id
      restartPolicy: OnFailure

故障排查指南

常见问题1:备份失败

症状

  • 备份任务失败
  • 备份文件损坏

排查步骤

bash
# 查看备份任务日志
kubectl logs job/etcd-backup -n kube-system

# 查看Velero备份状态
velero backup describe production-backup

# 查看备份存储位置
velero backup-location get

# 检查存储权限
aws s3 ls s3://my-k8s-backups/

# 验证备份文件完整性
ETCDCTL_API=3 etcdctl snapshot status snapshot.db

解决方案

yaml
# 增加备份超时时间
spec:
  template:
    spec:
      containers:
      - name: backup
        env:
        - name: BACKUP_TIMEOUT
          value: "600"

常见问题2:恢复失败

症状

  • 恢复过程卡住
  • 恢复后服务无法启动

排查步骤

bash
# 查看恢复任务日志
velero restore logs production-restore

# 查看Pod状态
kubectl get pods -n production

# 查看Pod事件
kubectl describe pod <pod-name> -n production

# 检查PV状态
kubectl get pv

# 检查PVC状态
kubectl get pvc -n production

解决方案

bash
# 手动恢复PV
kubectl apply -f pv-backup.yaml

# 手动恢复PVC
kubectl apply -f pvc-backup.yaml

# 重启恢复任务
velero restore create --from-backup production-backup --restore-volumes=true

常见问题3:故障转移失败

症状

  • 自动故障转移未触发
  • 故障转移后服务不可用

排查步骤

bash
# 查看集群健康状态
kubectl get clusters -n production

# 查看故障转移配置
kubectl get configmap failover-config -n production -o yaml

# 查看故障转移日志
kubectl logs job/failover -n production

# 检查DNS配置
nslookup www.example.com

# 检查负载均衡器配置
kubectl get services -n production

解决方案

yaml
# 手动触发故障转移
apiVersion: batch/v1
kind: Job
metadata:
  name: manual-failover
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: failover
        image: bitnami/kubectl:latest
        command:
        - /bin/sh
        - -c
        - |
          # 更新DNS
          kubectl apply -f dns-green.yaml
          
          # 更新负载均衡器
          kubectl apply -f lb-green.yaml
          
          # 验证服务
          kubectl get pods -n production
      restartPolicy: OnFailure

最佳实践建议

1. 备份策略最佳实践

3-2-1备份原则

yaml
# 3份备份副本
# 2种不同存储介质
# 1份异地备份

备份频率

yaml
# 关键数据:每小时备份
# 重要数据:每天备份
# 一般数据:每周备份

备份验证

yaml
# 定期恢复测试
# 数据完整性验证
# 备份监控告警

2. 恢复策略最佳实践

恢复优先级

yaml
# 1. 核心基础设施
# 2. 数据库服务
# 3. 核心业务服务
# 4. 辅助服务

恢复测试

yaml
# 定期恢复演练
# 自动化恢复测试
# 恢复时间验证

3. 多集群管理最佳实践

集群规划

yaml
# 主集群:承载主要业务
# 备集群:实时同步,随时切换
# 测试集群:验证备份恢复

数据同步

yaml
# 实时同步:关键数据
# 定时同步:一般数据
# 手动同步:历史数据

4. 监控告警最佳实践

监控指标

yaml
# 备份成功率
# 备份时长
# 存储空间使用
# 恢复成功率
# RTO/RPO达标率

告警配置

yaml
# 备份失败告警
# 存储空间不足告警
# 恢复失败告警
# 集群故障告警

总结

灾备恢复是Kubernetes生产环境的关键能力,本章我们学习了:

  1. 备份策略:ETCD备份、Velero备份、数据库备份
  2. 灾难恢复:恢复流程、Velero恢复、多集群故障转移
  3. 多集群管理:集群联邦、跨集群服务发现
  4. 实践示例:完整灾备方案、跨区域灾备、蓝绿灾备切换
  5. 故障排查:常见问题的诊断和解决方案
  6. 最佳实践:生产环境的灾备经验和建议

通过本章的学习,您应该能够构建可靠的灾备恢复体系,确保业务的连续性和数据的安全性。

下一步学习

参考资源