灾备恢复
概述
灾备恢复是Kubernetes生产环境的关键能力,本章将深入探讨如何构建可靠的备份和灾难恢复体系,包括备份策略、灾难恢复、多集群管理等内容。
核心概念
灾备恢复目标
- RPO(恢复点目标):数据丢失的最大时间范围
- RTO(恢复时间目标):系统恢复的最大时间
- 数据一致性:备份数据的完整性
- 业务连续性:最小化业务中断时间
备份类型
- 完整备份:全量数据备份
- 增量备份:仅备份变化数据
- 差异备份:相对于基准备份的变化
- 快照备份:存储级别的快速备份
灾难恢复策略
- 冷备:定期备份,手动恢复
- 温备:定期备份,半自动恢复
- 热备:实时同步,自动切换
- 多活:多集群同时服务
备份策略
ETCD备份
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "*/30 * * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/master: ""
containers:
- name: etcd-backup
image: bitnami/etcd:latest
command:
- /bin/sh
- -c
- |
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379
find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: Directory
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
restartPolicy: OnFailure
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: etcd-backup-pvc
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssdVelero备份
yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: aws-backup
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-k8s-backups
prefix: velero
config:
region: us-west-2
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: aws-snapshots
namespace: velero
spec:
provider: aws
config:
region: us-west-2
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- production
- staging
excludedResources:
- events
- pods
storageLocation: aws-backup
volumeSnapshotLocations:
- aws-snapshots
ttl: 720h
hooks:
resources:
- name: pre-backup-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
app: mysql
pre:
- exec:
container: mysql
command:
- /bin/sh
- -c
- "mysqldump --all-databases > /backup/dump.sql"
onError: Continue
timeout: 300s
---
apiVersion: velero.io/v1
kind: Backup
metadata:
name: production-backup
namespace: velero
spec:
includedNamespaces:
- production
excludedResources:
- events
storageLocation: aws-backup
volumeSnapshotLocations:
- aws-snapshots
ttl: 720h
hooks:
resources:
- name: pre-backup-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
backup: enabled
pre:
- exec:
container: app
command:
- /bin/sh
- -c
- "sync && echo 3 > /proc/sys/vm/drop_caches"
onError: Continue
timeout: 60s数据库备份
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: mysql-backup
namespace: production
spec:
schedule: "0 */6 * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: mysql-backup
image: mysql:8.0
command:
- /bin/sh
- -c
- |
mysqldump -h mysql -u root -p$MYSQL_ROOT_PASSWORD --all-databases --single-transaction --routines --triggers --events > /backup/mysql-backup-$(date +%Y%m%d-%H%M%S).sql
gzip /backup/mysql-backup-$(date +%Y%m%d-%H%M%S).sql
find /backup -name "mysql-backup-*.sql.gz" -mtime +30 -delete
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: root-password
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: mysql-backup-pvc
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: mongodb-backup
namespace: production
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: mongodb-backup
image: mongo:latest
command:
- /bin/sh
- -c
- |
mongodump --uri="mongodb://mongodb:27017" --out=/backup/mongodb-backup-$(date +%Y%m%d-%H%M%S)
tar -czf /backup/mongodb-backup-$(date +%Y%m%d-%H%M%S).tar.gz -C /backup mongodb-backup-$(date +%Y%m%d-%H%M%S)
rm -rf /backup/mongodb-backup-$(date +%Y%m%d-%H%M%S)
find /backup -name "mongodb-backup-*.tar.gz" -mtime +30 -delete
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: mongodb-backup-pvc
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: redis-backup
namespace: production
spec:
schedule: "0 */1 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: redis-backup
image: redis:latest
command:
- /bin/sh
- -c
- |
redis-cli -h redis BGSAVE
sleep 10
cp /data/dump.rdb /backup/redis-backup-$(date +%Y%m%d-%H%M%S).rdb
find /backup -name "redis-backup-*.rdb" -mtime +7 -delete
volumeMounts:
- name: redis-data
mountPath: /data
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: redis-data
persistentVolumeClaim:
claimName: redis-pvc
- name: backup
persistentVolumeClaim:
claimName: redis-backup-pvc
restartPolicy: OnFailure灾难恢复
恢复流程配置
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: disaster-recovery-plan
namespace: production
data:
recovery-plan.md: |
# 灾难恢复计划
## 恢复优先级
1. 核心服务恢复(API网关、认证服务)
2. 数据库恢复(MySQL、MongoDB、Redis)
3. 业务服务恢复(用户服务、订单服务)
4. 监控告警恢复(Prometheus、Grafana)
## 恢复步骤
### 1. 集群恢复
```bash
# 恢复ETCD
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd
# 重启ETCD
systemctl restart etcd
```
### 2. 数据库恢复
```bash
# 恢复MySQL
gunzip < /backup/mysql-backup.sql.gz | mysql -u root -p
# 恢复MongoDB
tar -xzf /backup/mongodb-backup.tar.gz
mongorestore /backup/mongodb-backup
# 恢复Redis
cp /backup/redis-backup.rdb /data/dump.rdb
```
### 3. 应用恢复
```bash
# 使用Velero恢复
velero restore create --from-backup production-backup
# 或手动恢复
kubectl apply -f /backup/production/
```
## 验证步骤
1. 检查所有Pod状态
2. 验证服务连通性
3. 执行冒烟测试
4. 检查数据完整性
---
apiVersion: batch/v1
kind: Job
metadata:
name: disaster-recovery-test
namespace: production
spec:
template:
spec:
containers:
- name: recovery-test
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
echo "Starting disaster recovery test..."
# 检查所有Pod状态
kubectl get pods -n production
# 验证服务连通性
kubectl run test-curl --image=curlimages/curl -n production --rm -it -- curl http://user-service:8080/actuator/health
# 执行冒烟测试
kubectl exec deployment/api-gateway -n production -- curl http://user-service:8080/api/health
# 检查数据完整性
kubectl exec deployment/mysql -n production -- mysql -u root -p$MYSQL_ROOT_PASSWORD -e "SHOW DATABASES;"
echo "Disaster recovery test completed successfully!"
restartPolicy: OnFailureVelero恢复
yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: production-restore
namespace: velero
spec:
backupName: production-backup
includedNamespaces:
- production
excludedResources:
- events
- pods
restorePVs: true
preserveNodePorts: true
hooks:
resources:
- name: post-restore-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
app: mysql
post:
- exec:
container: mysql
command:
- /bin/sh
- -c
- "mysql < /backup/dump.sql"
onError: Continue
timeout: 300s多集群故障转移
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-failover-config
namespace: production
data:
clusters.yaml: |
clusters:
- name: primary
endpoint: https://cluster1.example.com
region: us-west-1
priority: 1
- name: secondary
endpoint: https://cluster2.example.com
region: us-east-1
priority: 2
- name: tertiary
endpoint: https://cluster3.example.com
region: eu-west-1
priority: 3
failover:
enabled: true
healthCheckInterval: 30s
failureThreshold: 3
recoveryThreshold: 2
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: cluster-health-check
namespace: production
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: health-check
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# 检查主集群健康状态
if ! kubectl get nodes; then
echo "Primary cluster is unhealthy, initiating failover..."
# 更新DNS指向备用集群
# 更新负载均衡器配置
# 通知运维团队
# 触发故障转移
kubectl create configmap failover-trigger --from-literal=triggered=true -n production
fi
restartPolicy: OnFailure多集群管理
集群联邦配置
yaml
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
name: cluster1
namespace: kube-federation-system
spec:
apiEndpoint: https://cluster1.example.com
caBundle: <base64-encoded-ca-bundle>
tlsClientCert:
secretName: cluster1-secret
secretRef:
name: cluster1-secret
tlsClientKey:
secretName: cluster1-secret
secretRef:
name: cluster1-secret
---
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
name: cluster2
namespace: kube-federation-system
spec:
apiEndpoint: https://cluster2.example.com
caBundle: <base64-encoded-ca-bundle>
tlsClientCert:
secretName: cluster2-secret
secretRef:
name: cluster2-secret
tlsClientKey:
secretName: cluster2-secret
secretRef:
name: cluster2-secret
---
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
name: my-app
namespace: production
spec:
template:
metadata:
labels:
app: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
spec:
containers:
- name: my-app
image: registry.example.com/my-app:v1.0.0
ports:
- containerPort: 8080
placement:
clusters:
- name: cluster1
- name: cluster2
overrides:
- clusterName: cluster1
clusterOverrides:
- path: spec.replicas
value: 5
- clusterName: cluster2
clusterOverrides:
- path: spec.replicas
value: 3跨集群服务发现
yaml
apiVersion: types.kubefed.io/v1beta1
kind: FederatedService
metadata:
name: my-app
namespace: production
spec:
template:
metadata:
labels:
app: my-app
spec:
type: LoadBalancer
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
placement:
clusters:
- name: cluster1
- name: cluster2
---
apiVersion: types.kubefed.io/v1beta1
kind: FederatedIngress
metadata:
name: my-app-ingress
namespace: production
spec:
template:
metadata:
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- host: www.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 80
placement:
clusters:
- name: cluster1
- name: cluster2kubectl操作命令
备份管理
bash
# ETCD备份
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 查看ETCD备份状态
ETCDCTL_API=3 etcdctl snapshot status snapshot.db
# Velero备份
velero backup create production-backup --include-namespaces production
# 查看Velero备份
velero backup get
# 查看Velero备份详情
velero backup describe production-backup
# 查看Velero备份日志
velero backup logs production-backup
# 创建定时备份
velero schedule create daily-backup --schedule="0 2 * * *"
# 查看定时备份
velero schedule get
# 手动触发备份
velero backup create --from-schedule daily-backup恢复管理
bash
# ETCD恢复
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir=/var/lib/etcd
# Velero恢复
velero restore create --from-backup production-backup
# 查看Velero恢复
velero restore get
# 查看Velero恢复详情
velero restore describe production-restore
# 查看Velero恢复日志
velero restore logs production-restore
# 恢复特定资源
velero restore create --from-backup production-backup --include-resources deployments,services
# 恢复特定命名空间
velero restore create --from-backup production-backup --include-namespaces production多集群管理
bash
# 查看联邦集群
kubectl get kubefedclusters -n kube-federation-system
# 查看联邦资源
kubectl get federateddeployments -n production
# 查看跨集群服务
kubectl get federatedservices -n production
# 手动触发故障转移
kubectl create configmap failover-trigger --from-literal=triggered=true -n production
# 查看集群健康状态
kubectl get clusters -n production
# 同步资源到所有集群
kubectl apply -f federated-deployment.yaml数据库备份恢复
bash
# MySQL备份
kubectl exec deployment/mysql -n production -- mysqldump -u root -p$MYSQL_ROOT_PASSWORD --all-databases > mysql-backup.sql
# MySQL恢复
kubectl exec -i deployment/mysql -n production -- mysql -u root -p$MYSQL_ROOT_PASSWORD < mysql-backup.sql
# MongoDB备份
kubectl exec deployment/mongodb -n production -- mongodump --out=/backup
# MongoDB恢复
kubectl exec deployment/mongodb -n production -- mongorestore /backup
# Redis备份
kubectl exec deployment/redis -n production -- redis-cli BGSAVE
# Redis恢复
kubectl cp redis-backup.rdb production/redis-pod:/data/dump.rdb
kubectl exec deployment/redis -n production -- redis-cli SHUTDOWN NOSAVE实践示例
示例1:完整灾备方案
场景描述
构建一个完整的灾备方案,包括定期备份、自动恢复、故障转移等。
配置文件
yaml
# 备份策略配置
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-policy
namespace: production
data:
policy.yaml: |
backup:
etcd:
enabled: true
schedule: "*/30 * * * *"
retention: 7d
applications:
enabled: true
schedule: "0 2 * * *"
retention: 30d
databases:
mysql:
enabled: true
schedule: "0 */6 * * *"
retention: 30d
mongodb:
enabled: true
schedule: "0 */6 * * *"
retention: 30d
redis:
enabled: true
schedule: "0 */1 * * *"
retention: 7d
recovery:
autoRecovery: true
healthCheckInterval: 30s
failureThreshold: 3
failover:
enabled: true
clusters:
- name: primary
endpoint: https://cluster1.example.com
- name: secondary
endpoint: https://cluster2.example.com
---
# 自动化备份脚本
apiVersion: batch/v1
kind: CronJob
metadata:
name: automated-backup
namespace: production
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
echo "Starting automated backup..."
# 备份ETCD
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
# 备份应用配置
kubectl get all -n production -o yaml > /backup/production-$(date +%Y%m%d-%H%M%S).yaml
# 备份数据库
kubectl exec deployment/mysql -n production -- mysqldump --all-databases > /backup/mysql-$(date +%Y%m%d-%H%M%S).sql
# 上传到对象存储
aws s3 sync /backup s3://my-k8s-backups/$(date +%Y%m%d)/
# 清理旧备份
find /backup -mtime +30 -delete
echo "Backup completed successfully!"
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
---
# 健康检查和自动恢复
apiVersion: batch/v1
kind: CronJob
metadata:
name: health-check-and-recovery
namespace: production
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: health-check
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
echo "Performing health check..."
# 检查关键服务
if ! kubectl get pods -n production -l app=critical | grep Running; then
echo "Critical services are down, initiating recovery..."
# 尝试重启服务
kubectl rollout restart deployment/critical-service -n production
# 等待恢复
sleep 60
# 如果仍然失败,触发告警
if ! kubectl get pods -n production -l app=critical | grep Running; then
echo "Recovery failed, triggering alert..."
# 发送告警通知
fi
fi
echo "Health check completed."
restartPolicy: OnFailure示例2:跨区域灾备
场景描述
构建跨区域的灾备方案,实现数据的异地备份和故障转移。
配置文件
yaml
# 跨区域复制配置
apiVersion: v1
kind: ConfigMap
metadata:
name: cross-region-replication
namespace: production
data:
replication.yaml: |
regions:
- name: us-west-2
primary: true
endpoint: https://cluster-west.example.com
backupLocation: s3://backup-west
- name: us-east-1
primary: false
endpoint: https://cluster-east.example.com
backupLocation: s3://backup-east
replication:
enabled: true
syncInterval: 5m
dataTypes:
- etcd
- databases
- applications
---
# 数据同步Job
apiVersion: batch/v1
kind: CronJob
metadata:
name: cross-region-sync
namespace: production
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: sync
image: amazon/aws-cli:latest
command:
- /bin/sh
- -c
- |
echo "Starting cross-region sync..."
# 同步备份到异地
aws s3 sync s3://backup-west s3://backup-east --source-region us-west-2 --dest-region us-east-1
# 验证数据完整性
aws s3 ls s3://backup-east --recursive | wc -l
echo "Cross-region sync completed."
restartPolicy: OnFailure
---
# 故障转移配置
apiVersion: v1
kind: ConfigMap
metadata:
name: failover-config
namespace: production
data:
failover.yaml: |
triggers:
- type: cluster-down
threshold: 3
action: switch-to-secondary
- type: data-loss
threshold: 1
action: restore-from-backup
- type: performance-degradation
threshold: 5
action: scale-out
notifications:
- type: email
recipients:
- ops@example.com
- type: slack
channel: "#alerts"示例3:蓝绿灾备切换
场景描述
使用蓝绿部署模式实现灾备切换,确保零停机时间。
配置文件
yaml
# 蓝绿灾备配置
apiVersion: v1
kind: ConfigMap
metadata:
name: blue-green-dr
namespace: production
data:
blue-green.yaml: |
environments:
blue:
cluster: cluster-blue
endpoint: https://blue.example.com
active: true
green:
cluster: cluster-green
endpoint: https://green.example.com
active: false
switchStrategy: rolling
healthCheckTimeout: 300s
rollbackEnabled: true
---
# DNS切换脚本
apiVersion: batch/v1
kind: Job
metadata:
name: dns-switch
namespace: production
spec:
template:
spec:
containers:
- name: dns-switch
image: amazon/aws-cli:latest
command:
- /bin/sh
- -c
- |
echo "Switching DNS to green environment..."
# 更新Route53记录
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "green.example.com"}]
}
}
]
}'
echo "DNS switch completed."
env:
- name: HOSTED_ZONE_ID
valueFrom:
configMapKeyRef:
name: dns-config
key: hosted-zone-id
restartPolicy: OnFailure故障排查指南
常见问题1:备份失败
症状
- 备份任务失败
- 备份文件损坏
排查步骤
bash
# 查看备份任务日志
kubectl logs job/etcd-backup -n kube-system
# 查看Velero备份状态
velero backup describe production-backup
# 查看备份存储位置
velero backup-location get
# 检查存储权限
aws s3 ls s3://my-k8s-backups/
# 验证备份文件完整性
ETCDCTL_API=3 etcdctl snapshot status snapshot.db解决方案
yaml
# 增加备份超时时间
spec:
template:
spec:
containers:
- name: backup
env:
- name: BACKUP_TIMEOUT
value: "600"常见问题2:恢复失败
症状
- 恢复过程卡住
- 恢复后服务无法启动
排查步骤
bash
# 查看恢复任务日志
velero restore logs production-restore
# 查看Pod状态
kubectl get pods -n production
# 查看Pod事件
kubectl describe pod <pod-name> -n production
# 检查PV状态
kubectl get pv
# 检查PVC状态
kubectl get pvc -n production解决方案
bash
# 手动恢复PV
kubectl apply -f pv-backup.yaml
# 手动恢复PVC
kubectl apply -f pvc-backup.yaml
# 重启恢复任务
velero restore create --from-backup production-backup --restore-volumes=true常见问题3:故障转移失败
症状
- 自动故障转移未触发
- 故障转移后服务不可用
排查步骤
bash
# 查看集群健康状态
kubectl get clusters -n production
# 查看故障转移配置
kubectl get configmap failover-config -n production -o yaml
# 查看故障转移日志
kubectl logs job/failover -n production
# 检查DNS配置
nslookup www.example.com
# 检查负载均衡器配置
kubectl get services -n production解决方案
yaml
# 手动触发故障转移
apiVersion: batch/v1
kind: Job
metadata:
name: manual-failover
namespace: production
spec:
template:
spec:
containers:
- name: failover
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
# 更新DNS
kubectl apply -f dns-green.yaml
# 更新负载均衡器
kubectl apply -f lb-green.yaml
# 验证服务
kubectl get pods -n production
restartPolicy: OnFailure最佳实践建议
1. 备份策略最佳实践
3-2-1备份原则
yaml
# 3份备份副本
# 2种不同存储介质
# 1份异地备份备份频率
yaml
# 关键数据:每小时备份
# 重要数据:每天备份
# 一般数据:每周备份备份验证
yaml
# 定期恢复测试
# 数据完整性验证
# 备份监控告警2. 恢复策略最佳实践
恢复优先级
yaml
# 1. 核心基础设施
# 2. 数据库服务
# 3. 核心业务服务
# 4. 辅助服务恢复测试
yaml
# 定期恢复演练
# 自动化恢复测试
# 恢复时间验证3. 多集群管理最佳实践
集群规划
yaml
# 主集群:承载主要业务
# 备集群:实时同步,随时切换
# 测试集群:验证备份恢复数据同步
yaml
# 实时同步:关键数据
# 定时同步:一般数据
# 手动同步:历史数据4. 监控告警最佳实践
监控指标
yaml
# 备份成功率
# 备份时长
# 存储空间使用
# 恢复成功率
# RTO/RPO达标率告警配置
yaml
# 备份失败告警
# 存储空间不足告警
# 恢复失败告警
# 集群故障告警总结
灾备恢复是Kubernetes生产环境的关键能力,本章我们学习了:
- 备份策略:ETCD备份、Velero备份、数据库备份
- 灾难恢复:恢复流程、Velero恢复、多集群故障转移
- 多集群管理:集群联邦、跨集群服务发现
- 实践示例:完整灾备方案、跨区域灾备、蓝绿灾备切换
- 故障排查:常见问题的诊断和解决方案
- 最佳实践:生产环境的灾备经验和建议
通过本章的学习,您应该能够构建可靠的灾备恢复体系,确保业务的连续性和数据的安全性。
下一步学习
- 学习路线总结 - 规划K8S学习路径