Disaster Recovery and Business Continuity
Operations > Disaster Recovery
Version: 1.0 Last Updated: 2025-11-10 Status: Production Ready RTO Target: 1-4 hours (tier-dependent) RPO Target: 5 minutes - 24 hours (tier-dependent)
← Back to Operations | Documentation Home | Memory Systems
Table of Contents
- Introduction
- Backup Strategies
- Recovery Procedures
- RTO and RPO Targets
- Disaster Scenarios
- Backup Automation
- Testing and Validation
- Compliance and Audit
- Incident Response
- Multi-Region Deployment
Introduction
Importance of Disaster Recovery
A comprehensive disaster recovery (DR) strategy is critical for OctoLLM's operational resilience and business continuity. Without proper DR capabilities:
Business Impact:
- Service disruption leads to revenue loss
- Customer trust and reputation damage
- SLA violations and contractual penalties
- Competitive disadvantage
Data Loss Consequences:
- Loss of critical task history and knowledge
- User data and preferences unrecoverable
- Training data for model improvements lost
- Audit trails and compliance evidence missing
Security Implications:
- Inability to recover from ransomware attacks
- No rollback capability after security breaches
- Forensic evidence may be destroyed
- Compliance violations (GDPR, SOC 2)
Operational Costs:
- Emergency recovery efforts are expensive
- Extended downtime multiplies costs
- Manual recovery is error-prone and slow
- Loss of productivity across organization
RTO and RPO Targets
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) define acceptable downtime and data loss:
| Service Tier | RTO | RPO | Backup Frequency | Use Case |
|---|---|---|---|---|
| Critical | 1 hour | 5 minutes | Continuous + Hourly | Orchestrator, PostgreSQL |
| Important | 4 hours | 1 hour | Every 6 hours | Arms, Redis, Qdrant |
| Standard | 24 hours | 24 hours | Daily | Logs, Metrics, Analytics |
| Archive | 7 days | 7 days | Weekly | Historical data, Compliance |
RTO (Recovery Time Objective):
- Maximum acceptable downtime
- Time to restore service functionality
- Includes detection, decision-making, and recovery
RPO (Recovery Point Objective):
- Maximum acceptable data loss
- Time between last backup and failure
- Determines backup frequency
Disaster Scenarios
OctoLLM DR planning covers these disaster categories:
Infrastructure Failures
- Hardware failures (disk, network, compute)
- Complete cluster failure
- Data center outage
- Network partition
Data Disasters
- Database corruption
- Accidental deletion
- Data inconsistency
- Storage system failure
Security Incidents
- Ransomware attack
- Data breach with compromise
- Unauthorized access
- Malicious insider actions
Operational Errors
- Failed deployment
- Configuration errors
- Software bugs causing data corruption
- Accidental infrastructure deletion
Natural Disasters
- Regional power outage
- Natural disasters (earthquake, flood, fire)
- Catastrophic facility failure
DR Strategy Overview
OctoLLM implements a multi-layered DR strategy:
graph TB
subgraph "Layer 1: High Availability"
HA[Pod Replication]
LB[Load Balancing]
HK[Health Checks]
end
subgraph "Layer 2: Continuous Backup"
WAL[WAL Archiving]
SNAP[Snapshots]
REPL[Replication]
end
subgraph "Layer 3: Offsite Backup"
S3[S3 Storage]
GEO[Geographic Redundancy]
ENC[Encryption]
end
subgraph "Layer 4: DR Automation"
AUTO[Automated Recovery]
TEST[Regular Testing]
MON[Monitoring]
end
HA --> WAL
LB --> SNAP
HK --> REPL
WAL --> S3
SNAP --> GEO
REPL --> ENC
S3 --> AUTO
GEO --> TEST
ENC --> MON
style HA fill:#9f9,stroke:#333
style WAL fill:#ff9,stroke:#333
style S3 fill:#f99,stroke:#333
style AUTO fill:#99f,stroke:#333
Defense in Depth Approach:
- Prevention: Redundancy, health checks, validation
- Protection: Continuous backups, replication, versioning
- Detection: Monitoring, alerting, anomaly detection
- Response: Automated failover, manual procedures
- Recovery: Point-in-time restore, full restoration
- Learning: Post-incident reviews, process improvement
Backup Strategies
PostgreSQL Backups
PostgreSQL is the authoritative source of truth for structured data, requiring comprehensive backup strategy.
Continuous Archiving with WAL
Write-Ahead Logging (WAL) provides continuous backup capability:
---
# PostgreSQL ConfigMap with WAL archiving
apiVersion: v1
kind: ConfigMap
metadata:
name: postgresql-config
namespace: octollm
data:
postgresql.conf: |
# WAL Configuration
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://octollm-wal-archive/%f --region us-east-1'
archive_timeout = 300
# Checkpoint Configuration
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 1GB
# Replication
max_wal_senders = 10
wal_keep_size = 1GB
hot_standby = on
# Performance
shared_buffers = 2GB
effective_cache_size = 6GB
maintenance_work_mem = 512MB
work_mem = 16MB
# Logging
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
Automated Full Backups
Daily full backups using pg_dump with compression:
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgresql-backup
namespace: octollm
labels:
app: postgresql-backup
component: backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 3
activeDeadlineSeconds: 3600 # 1 hour timeout
template:
metadata:
labels:
app: postgresql-backup
spec:
restartPolicy: OnFailure
serviceAccountName: backup-service-account
# Security context
securityContext:
runAsUser: 999
runAsGroup: 999
fsGroup: 999
containers:
- name: backup
image: postgres:15-alpine
imagePullPolicy: IfNotPresent
env:
# PostgreSQL connection
- name: PGHOST
value: postgresql
- name: PGPORT
value: "5432"
- name: PGDATABASE
value: octollm
- name: PGUSER
valueFrom:
secretKeyRef:
name: octollm-postgres-secret
key: username
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: octollm-postgres-secret
key: password
# AWS credentials
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
- name: AWS_DEFAULT_REGION
value: us-east-1
# Backup configuration
- name: BACKUP_BUCKET
value: s3://octollm-backups
- name: RETENTION_DAYS
value: "30"
command:
- /bin/sh
- -c
- |
set -e
# Generate timestamp
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="octollm-${TIMESTAMP}.sql.gz"
BACKUP_PATH="/backups/${BACKUP_FILE}"
echo "==================================="
echo "PostgreSQL Backup Starting"
echo "Timestamp: $(date)"
echo "Database: ${PGDATABASE}"
echo "==================================="
# Create backup directory
mkdir -p /backups
# Full database dump with compression
echo "Creating database dump..."
pg_dump -Fc \
--verbose \
--no-owner \
--no-acl \
--clean \
--if-exists \
${PGDATABASE} | gzip -9 > "${BACKUP_PATH}"
# Verify backup file exists
if [ ! -f "${BACKUP_PATH}" ]; then
echo "ERROR: Backup file not created"
exit 1
fi
# Check backup size
BACKUP_SIZE=$(stat -c%s "${BACKUP_PATH}" 2>/dev/null || stat -f%z "${BACKUP_PATH}")
BACKUP_SIZE_MB=$((BACKUP_SIZE / 1024 / 1024))
echo "Backup size: ${BACKUP_SIZE_MB} MB"
# Minimum size check (should be at least 1MB)
if [ ${BACKUP_SIZE_MB} -lt 1 ]; then
echo "ERROR: Backup size too small (${BACKUP_SIZE_MB} MB)"
exit 1
fi
# Upload to S3
echo "Uploading to S3..."
aws s3 cp "${BACKUP_PATH}" \
"${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}" \
--storage-class STANDARD_IA \
--server-side-encryption AES256
# Verify S3 upload
if ! aws s3 ls "${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}"; then
echo "ERROR: S3 upload verification failed"
exit 1
fi
echo "Backup uploaded successfully"
# Create metadata file
cat > /backups/metadata.json <<EOF
{
"timestamp": "${TIMESTAMP}",
"database": "${PGDATABASE}",
"backup_file": "${BACKUP_FILE}",
"size_bytes": ${BACKUP_SIZE},
"size_mb": ${BACKUP_SIZE_MB},
"s3_path": "${BACKUP_BUCKET}/postgresql/${BACKUP_FILE}",
"pg_version": "$(pg_dump --version | head -n1)",
"completed_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
EOF
# Upload metadata
aws s3 cp /backups/metadata.json \
"${BACKUP_BUCKET}/postgresql/metadata-${TIMESTAMP}.json"
# Clean up local files older than retention period
echo "Cleaning up old local backups..."
find /backups -name "octollm-*.sql.gz" -mtime +${RETENTION_DAYS} -delete
# Test backup integrity (if small enough)
if [ ${BACKUP_SIZE_MB} -lt 100 ]; then
echo "Testing backup integrity..."
gunzip -c "${BACKUP_PATH}" | pg_restore --list > /dev/null
if [ $? -eq 0 ]; then
echo "Backup integrity test passed"
else
echo "WARNING: Backup integrity test failed"
fi
fi
echo "==================================="
echo "Backup completed successfully"
echo "File: ${BACKUP_FILE}"
echo "Size: ${BACKUP_SIZE_MB} MB"
echo "==================================="
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
Backup Storage PVC
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup-pvc
namespace: octollm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
S3 Lifecycle Policy
Automate backup retention and cost optimization:
{
"Rules": [
{
"Id": "PostgreSQL-Backup-Lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "postgresql/"
},
"Transitions": [
{
"Days": 7,
"StorageClass": "STANDARD_IA"
},
{
"Days": 30,
"StorageClass": "GLACIER_IR"
},
{
"Days": 90,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 365
}
}
]
}
Backup Monitoring
Monitor backup success and failures:
import boto3
from datetime import datetime, timedelta
import structlog
logger = structlog.get_logger()
class BackupMonitor:
"""Monitor PostgreSQL backup health."""
def __init__(self, s3_bucket: str):
self.s3_client = boto3.client('s3')
self.s3_bucket = s3_bucket
def check_backup_health(self) -> dict:
"""Check if recent backup exists and is valid."""
# List recent backups
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='postgresql/',
MaxKeys=10
)
if 'Contents' not in response:
return {
"status": "critical",
"message": "No backups found",
"last_backup": None
}
# Sort by last modified
backups = sorted(
response['Contents'],
key=lambda x: x['LastModified'],
reverse=True
)
latest_backup = backups[0]
backup_age = datetime.now(latest_backup['LastModified'].tzinfo) - latest_backup['LastModified']
# Check backup age
if backup_age > timedelta(days=2):
status = "critical"
message = f"Last backup is {backup_age.days} days old"
elif backup_age > timedelta(hours=25):
status = "warning"
message = f"Last backup is {backup_age.total_seconds() / 3600:.1f} hours old"
else:
status = "healthy"
message = "Backups are current"
# Check backup size
size_mb = latest_backup['Size'] / (1024 * 1024)
if size_mb < 1:
status = "critical"
message = f"Latest backup suspiciously small: {size_mb:.2f} MB"
return {
"status": status,
"message": message,
"last_backup": latest_backup['LastModified'].isoformat(),
"backup_age_hours": backup_age.total_seconds() / 3600,
"backup_size_mb": size_mb,
"backup_key": latest_backup['Key']
}
def verify_backup_integrity(self, backup_key: str) -> bool:
"""Download and verify backup integrity."""
try:
# Download metadata
metadata_key = backup_key.replace('.sql.gz', '-metadata.json')
response = self.s3_client.get_object(
Bucket=self.s3_bucket,
Key=metadata_key
)
metadata = json.loads(response['Body'].read())
# Verify size matches
backup_obj = self.s3_client.head_object(
Bucket=self.s3_bucket,
Key=backup_key
)
if backup_obj['ContentLength'] != metadata['size_bytes']:
logger.error(
"backup_size_mismatch",
expected=metadata['size_bytes'],
actual=backup_obj['ContentLength']
)
return False
return True
except Exception as e:
logger.error("backup_verification_failed", error=str(e))
return False
# Prometheus metrics
from prometheus_client import Gauge, Counter
backup_age_hours = Gauge(
'octollm_postgresql_backup_age_hours',
'Hours since last successful backup'
)
backup_size_mb = Gauge(
'octollm_postgresql_backup_size_mb',
'Size of latest backup in MB'
)
backup_failures = Counter(
'octollm_postgresql_backup_failures_total',
'Total number of backup failures'
)
# Monitor backup health
monitor = BackupMonitor(s3_bucket='octollm-backups')
health = monitor.check_backup_health()
backup_age_hours.set(health['backup_age_hours'])
backup_size_mb.set(health['backup_size_mb'])
if health['status'] in ['critical', 'warning']:
backup_failures.inc()
logger.warning("backup_health_issue", **health)
Qdrant Vector Store Backups
Vector embeddings require specialized backup procedures.
Snapshot-Based Backups
from qdrant_client import QdrantClient
from qdrant_client.models import SnapshotDescription
import boto3
from datetime import datetime
from typing import List, Dict
import structlog
logger = structlog.get_logger()
class QdrantBackupManager:
"""Manage Qdrant vector store backups."""
def __init__(self, qdrant_url: str, s3_bucket: str):
self.client = QdrantClient(url=qdrant_url)
self.s3_client = boto3.client('s3')
self.s3_bucket = s3_bucket
async def backup_all_collections(self) -> Dict[str, str]:
"""Create snapshots of all collections and upload to S3."""
timestamp = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
results = {}
# Get all collections
collections = self.client.get_collections().collections
logger.info(
"qdrant_backup_started",
timestamp=timestamp,
collections=[c.name for c in collections]
)
for collection in collections:
try:
# Create snapshot
snapshot_info = self.client.create_snapshot(
collection_name=collection.name
)
logger.info(
"snapshot_created",
collection=collection.name,
snapshot=snapshot_info.name
)
# Download snapshot
snapshot_data = self.client.download_snapshot(
collection_name=collection.name,
snapshot_name=snapshot_info.name
)
# Upload to S3
s3_key = f"qdrant/{collection.name}/{timestamp}-{snapshot_info.name}"
self.s3_client.put_object(
Bucket=self.s3_bucket,
Key=s3_key,
Body=snapshot_data,
ServerSideEncryption='AES256',
StorageClass='STANDARD_IA'
)
logger.info(
"snapshot_uploaded",
collection=collection.name,
s3_key=s3_key
)
results[collection.name] = s3_key
# Delete local snapshot (save space)
self.client.delete_snapshot(
collection_name=collection.name,
snapshot_name=snapshot_info.name
)
except Exception as e:
logger.error(
"snapshot_backup_failed",
collection=collection.name,
error=str(e)
)
results[collection.name] = f"ERROR: {str(e)}"
logger.info("qdrant_backup_completed", results=results)
return results
async def restore_collection(
self,
collection_name: str,
snapshot_s3_key: str,
overwrite: bool = False
) -> bool:
"""Restore collection from S3 snapshot."""
try:
# Download from S3
response = self.s3_client.get_object(
Bucket=self.s3_bucket,
Key=snapshot_s3_key
)
snapshot_data = response['Body'].read()
# Write to temp file
import tempfile
with tempfile.NamedTemporaryFile(delete=False, suffix='.snapshot') as f:
f.write(snapshot_data)
snapshot_path = f.name
# Delete existing collection if overwrite
if overwrite:
try:
self.client.delete_collection(collection_name)
logger.info("collection_deleted_for_restore", collection=collection_name)
except Exception:
pass # Collection might not exist
# Upload snapshot to Qdrant
self.client.upload_snapshot(
collection_name=collection_name,
snapshot_path=snapshot_path
)
# Recover from snapshot
self.client.recover_snapshot(
collection_name=collection_name,
snapshot_name=snapshot_path.split('/')[-1]
)
logger.info("collection_restored", collection=collection_name)
return True
except Exception as e:
logger.error(
"collection_restore_failed",
collection=collection_name,
error=str(e)
)
return False
def list_available_backups(self, collection_name: str = None) -> List[Dict]:
"""List available backups from S3."""
prefix = f"qdrant/{collection_name}/" if collection_name else "qdrant/"
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix=prefix
)
if 'Contents' not in response:
return []
backups = []
for obj in response['Contents']:
# Parse key to extract info
# Format: qdrant/{collection}/{timestamp}-{snapshot_name}
parts = obj['Key'].split('/')
if len(parts) >= 3:
collection = parts[1]
filename = parts[2]
backups.append({
'collection': collection,
'timestamp': filename.split('-')[0] if '-' in filename else 'unknown',
's3_key': obj['Key'],
'size_mb': obj['Size'] / (1024 * 1024),
'last_modified': obj['LastModified'].isoformat()
})
return sorted(backups, key=lambda x: x['last_modified'], reverse=True)
Automated Qdrant Backup CronJob
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: qdrant-backup
namespace: octollm
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
metadata:
labels:
app: qdrant-backup
spec:
restartPolicy: OnFailure
serviceAccountName: backup-service-account
containers:
- name: backup
image: octollm/qdrant-backup:1.0
env:
- name: QDRANT_URL
value: "http://qdrant:6333"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
- name: S3_BUCKET
value: "octollm-backups"
command:
- python
- -c
- |
import asyncio
from qdrant_backup import QdrantBackupManager
async def main():
manager = QdrantBackupManager(
qdrant_url=os.environ['QDRANT_URL'],
s3_bucket=os.environ['S3_BUCKET']
)
await manager.backup_all_collections()
asyncio.run(main())
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
Redis Persistence
Redis stores ephemeral cache data but still requires backup for fast recovery.
Redis Configuration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-config
namespace: octollm
data:
redis.conf: |
# RDB Persistence
save 900 1 # Save after 900 sec if at least 1 key changed
save 300 10 # Save after 300 sec if at least 10 keys changed
save 60 10000 # Save after 60 sec if at least 10000 keys changed
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /data
# AOF Persistence
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble yes
# Memory management
maxmemory 2gb
maxmemory-policy allkeys-lru
# Security
requirepass ${REDIS_PASSWORD}
# Logging
loglevel notice
logfile /var/log/redis/redis-server.log
Redis Backup Script
#!/bin/bash
# redis-backup.sh
set -e
REDIS_HOST="${REDIS_HOST:-redis}"
REDIS_PORT="${REDIS_PORT:-6379}"
REDIS_PASSWORD="${REDIS_PASSWORD}"
S3_BUCKET="${S3_BUCKET:-s3://octollm-backups}"
BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="redis-${TIMESTAMP}.rdb"
echo "==================================="
echo "Redis Backup Starting"
echo "Timestamp: $(date)"
echo "==================================="
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Trigger BGSAVE
redis-cli -h ${REDIS_HOST} -p ${REDIS_PORT} -a "${REDIS_PASSWORD}" BGSAVE
# Wait for BGSAVE to complete
while true; do
LASTSAVE=$(redis-cli -h ${REDIS_HOST} -p ${REDIS_PORT} -a "${REDIS_PASSWORD}" LASTSAVE)
sleep 5
NEWSAVE=$(redis-cli -h ${REDIS_HOST} -p ${REDIS_PORT} -a "${REDIS_PASSWORD}" LASTSAVE)
if [ "${LASTSAVE}" != "${NEWSAVE}" ]; then
break
fi
done
echo "BGSAVE completed"
# Copy RDB file
kubectl exec -n octollm redis-0 -- cat /data/dump.rdb > ${BACKUP_DIR}/${BACKUP_FILE}
# Compress
gzip ${BACKUP_DIR}/${BACKUP_FILE}
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE}.gz \
${S3_BUCKET}/redis/${BACKUP_FILE}.gz \
--storage-class STANDARD_IA
echo "Backup uploaded successfully"
# Clean up
rm ${BACKUP_DIR}/${BACKUP_FILE}.gz
# Verify
if aws s3 ls ${S3_BUCKET}/redis/${BACKUP_FILE}.gz; then
echo "Backup verified in S3"
else
echo "ERROR: Backup verification failed"
exit 1
fi
echo "==================================="
echo "Backup completed successfully"
echo "==================================="
Kubernetes Cluster Backups
Use Velero for comprehensive cluster-level backups.
Velero Installation
# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Install Velero in cluster
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket octollm-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
Scheduled Backups
---
# Daily full cluster backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: octollm-daily-backup
namespace: velero
spec:
schedule: "0 1 * * *" # Daily at 1 AM
template:
includedNamespaces:
- octollm
excludedNamespaces: []
includedResources:
- '*'
excludedResources:
- events
- events.events.k8s.io
includeClusterResources: true
snapshotVolumes: true
ttl: 720h # 30 days
storageLocation: default
volumeSnapshotLocations:
- default
labelSelector:
matchLabels:
backup: "true"
---
# Hourly backup of critical resources
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: octollm-hourly-critical
namespace: velero
spec:
schedule: "0 * * * *" # Every hour
template:
includedNamespaces:
- octollm
includedResources:
- configmaps
- secrets
- persistentvolumeclaims
- deployments
- statefulsets
excludedResources:
- events
snapshotVolumes: true
ttl: 168h # 7 days
storageLocation: default
labelSelector:
matchLabels:
tier: critical
Configuration and Secrets Backups
Backup Kubernetes configurations and secrets securely.
Backup Script
#!/bin/bash
# backup-k8s-configs.sh
set -e
NAMESPACE="octollm"
BACKUP_DIR="/backups/k8s-configs"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
S3_BUCKET="s3://octollm-backups"
echo "Backing up Kubernetes configurations..."
mkdir -p ${BACKUP_DIR}/${TIMESTAMP}
# Backup ConfigMaps
kubectl get configmaps -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/configmaps.yaml
# Backup Secrets (encrypted)
kubectl get secrets -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/secrets.yaml
# Backup Deployments
kubectl get deployments -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/deployments.yaml
# Backup StatefulSets
kubectl get statefulsets -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/statefulsets.yaml
# Backup Services
kubectl get services -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/services.yaml
# Backup PVCs
kubectl get pvc -n ${NAMESPACE} -o yaml > ${BACKUP_DIR}/${TIMESTAMP}/pvcs.yaml
# Create tarball
tar -czf ${BACKUP_DIR}/k8s-config-${TIMESTAMP}.tar.gz -C ${BACKUP_DIR} ${TIMESTAMP}
# Encrypt with GPG
gpg --encrypt \
--recipient backup@octollm.example.com \
${BACKUP_DIR}/k8s-config-${TIMESTAMP}.tar.gz
# Upload to S3
aws s3 cp ${BACKUP_DIR}/k8s-config-${TIMESTAMP}.tar.gz.gpg \
${S3_BUCKET}/k8s-configs/k8s-config-${TIMESTAMP}.tar.gz.gpg
# Clean up
rm -rf ${BACKUP_DIR}/${TIMESTAMP}
rm ${BACKUP_DIR}/k8s-config-${TIMESTAMP}.tar.gz*
echo "Kubernetes configurations backed up successfully"
Recovery Procedures
Point-in-Time Recovery (PITR)
Restore PostgreSQL to a specific point in time using WAL archives.
PITR Script
#!/bin/bash
# restore-postgres-pitr.sh
set -e
# Configuration
TARGET_TIME="${1:-$(date -u +"%Y-%m-%d %H:%M:%S UTC")}"
POSTGRES_NAMESPACE="octollm"
POSTGRES_STATEFULSET="postgresql"
BACKUP_BUCKET="s3://octollm-backups"
RESTORE_DIR="/restore"
echo "==================================="
echo "PostgreSQL Point-in-Time Recovery"
echo "Target Time: ${TARGET_TIME}"
echo "==================================="
# Step 1: Stop PostgreSQL
echo "Stopping PostgreSQL..."
kubectl scale statefulset ${POSTGRES_STATEFULSET} -n ${POSTGRES_NAMESPACE} --replicas=0
# Wait for pods to terminate
kubectl wait --for=delete pod -l app=postgresql -n ${POSTGRES_NAMESPACE} --timeout=300s
# Step 2: Download latest base backup
echo "Downloading base backup..."
LATEST_BACKUP=$(aws s3 ls ${BACKUP_BUCKET}/postgresql/ | sort | tail -n 1 | awk '{print $4}')
aws s3 cp ${BACKUP_BUCKET}/postgresql/${LATEST_BACKUP} /tmp/backup.sql.gz
# Step 3: Restore base backup
echo "Restoring base backup..."
gunzip -c /tmp/backup.sql.gz | kubectl exec -i -n ${POSTGRES_NAMESPACE} postgresql-0 -- \
psql -U octollm -d octollm
# Step 4: Configure recovery
echo "Configuring point-in-time recovery..."
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- bash -c "cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'aws s3 cp ${BACKUP_BUCKET}/wal/%f %p'
recovery_target_time = '${TARGET_TIME}'
recovery_target_action = 'promote'
EOF"
# Step 5: Start PostgreSQL in recovery mode
echo "Starting PostgreSQL in recovery mode..."
kubectl scale statefulset ${POSTGRES_STATEFULSET} -n ${POSTGRES_NAMESPACE} --replicas=1
# Wait for recovery to complete
echo "Waiting for recovery to complete..."
sleep 30
# Step 6: Verify recovery
echo "Verifying recovery..."
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U octollm -d octollm -c "\
SELECT pg_is_in_recovery(), \
pg_last_wal_replay_lsn(), \
now() - pg_last_xact_replay_timestamp() AS replication_lag;"
echo "==================================="
echo "Recovery completed successfully"
echo "==================================="
Recovery Configuration
-- recovery.conf (for PostgreSQL 11 and earlier)
restore_command = 'aws s3 cp s3://octollm-wal-archive/%f %p'
recovery_target_time = '2025-11-10 14:30:00 UTC'
recovery_target_action = 'promote'
-- For PostgreSQL 12+, use postgresql.conf:
-- restore_command = 'aws s3 cp s3://octollm-wal-archive/%f %p'
-- recovery_target_time = '2025-11-10 14:30:00 UTC'
-- And create signal file: touch /var/lib/postgresql/data/recovery.signal
Full Database Restoration
Complete database restoration from backup.
Restoration Script
#!/bin/bash
# restore-postgres-full.sh
set -e
BACKUP_FILE="${1}"
POSTGRES_NAMESPACE="octollm"
POSTGRES_STATEFULSET="postgresql"
BACKUP_BUCKET="s3://octollm-backups"
if [ -z "${BACKUP_FILE}" ]; then
echo "Usage: $0 <backup_file>"
echo "Available backups:"
aws s3 ls ${BACKUP_BUCKET}/postgresql/
exit 1
fi
echo "==================================="
echo "PostgreSQL Full Restoration"
echo "Backup: ${BACKUP_FILE}"
echo "==================================="
# Confirmation prompt
read -p "This will DELETE all current data. Continue? (yes/no): " CONFIRM
if [ "${CONFIRM}" != "yes" ]; then
echo "Restoration cancelled"
exit 0
fi
# Step 1: Scale down PostgreSQL
echo "Scaling down PostgreSQL..."
kubectl scale statefulset ${POSTGRES_STATEFULSET} -n ${POSTGRES_NAMESPACE} --replicas=0
kubectl wait --for=delete pod -l app=postgresql -n ${POSTGRES_NAMESPACE} --timeout=300s
# Step 2: Download backup
echo "Downloading backup..."
aws s3 cp ${BACKUP_BUCKET}/postgresql/${BACKUP_FILE} /tmp/restore.sql.gz
# Step 3: Verify backup integrity
echo "Verifying backup integrity..."
if ! gunzip -t /tmp/restore.sql.gz; then
echo "ERROR: Backup file is corrupted"
exit 1
fi
# Step 4: Scale up PostgreSQL
echo "Starting PostgreSQL..."
kubectl scale statefulset ${POSTGRES_STATEFULSET} -n ${POSTGRES_NAMESPACE} --replicas=1
kubectl wait --for=condition=ready pod -l app=postgresql -n ${POSTGRES_NAMESPACE} --timeout=300s
# Step 5: Drop existing database
echo "Dropping existing database..."
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U postgres -c "DROP DATABASE IF EXISTS octollm;"
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U postgres -c "CREATE DATABASE octollm OWNER octollm;"
# Step 6: Restore backup
echo "Restoring backup..."
gunzip -c /tmp/restore.sql.gz | kubectl exec -i -n ${POSTGRES_NAMESPACE} postgresql-0 -- \
pg_restore \
--verbose \
--no-owner \
--no-acl \
--clean \
--if-exists \
-U octollm \
-d octollm
# Step 7: Verify restoration
echo "Verifying restoration..."
TABLES=$(kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U octollm -d octollm -t -c "\
SELECT COUNT(*) FROM information_schema.tables WHERE table_schema = 'public';")
echo "Tables restored: ${TABLES}"
if [ "${TABLES}" -eq 0 ]; then
echo "ERROR: No tables found after restoration"
exit 1
fi
# Step 8: Run ANALYZE
echo "Running ANALYZE..."
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U octollm -d octollm -c "ANALYZE;"
# Step 9: Verify data integrity
echo "Verifying data integrity..."
kubectl exec -n ${POSTGRES_NAMESPACE} postgresql-0 -- psql -U octollm -d octollm -c "\
SELECT 'entities' AS table_name, COUNT(*) FROM entities
UNION ALL
SELECT 'task_history', COUNT(*) FROM task_history
UNION ALL
SELECT 'action_log', COUNT(*) FROM action_log;"
# Clean up
rm /tmp/restore.sql.gz
echo "==================================="
echo "Restoration completed successfully"
echo "==================================="
Partial Recovery
Restore specific tables or data without full restoration.
#!/bin/bash
# restore-postgres-partial.sh
set -e
BACKUP_FILE="${1}"
TABLE_NAME="${2}"
POSTGRES_NAMESPACE="octollm"
if [ -z "${BACKUP_FILE}" ] || [ -z "${TABLE_NAME}" ]; then
echo "Usage: $0 <backup_file> <table_name>"
exit 1
fi
echo "Partial restoration: ${TABLE_NAME} from ${BACKUP_FILE}"
# Download backup
aws s3 cp s3://octollm-backups/postgresql/${BACKUP_FILE} /tmp/backup.sql.gz
# Extract and restore specific table
gunzip -c /tmp/backup.sql.gz | pg_restore \
--verbose \
--no-owner \
--no-acl \
--table=${TABLE_NAME} \
-U octollm \
-d octollm
rm /tmp/backup.sql.gz
echo "Partial restoration completed"
Cluster Recovery
Restore entire Kubernetes cluster using Velero.
#!/bin/bash
# velero-restore.sh
set -e
BACKUP_NAME="${1}"
if [ -z "${BACKUP_NAME}" ]; then
echo "Usage: $0 <backup_name>"
echo "Available backups:"
velero backup get
exit 1
fi
echo "==================================="
echo "Cluster Recovery with Velero"
echo "Backup: ${BACKUP_NAME}"
echo "==================================="
# Confirmation
read -p "Restore from backup ${BACKUP_NAME}? (yes/no): " CONFIRM
if [ "${CONFIRM}" != "yes" ]; then
echo "Restore cancelled"
exit 0
fi
# Create restore
velero restore create --from-backup ${BACKUP_NAME}
# Monitor restore progress
echo "Monitoring restore progress..."
velero restore describe ${BACKUP_NAME} --details
# Wait for completion
while true; do
STATUS=$(velero restore get | grep ${BACKUP_NAME} | awk '{print $3}')
if [ "${STATUS}" = "Completed" ]; then
echo "Restore completed successfully"
break
elif [ "${STATUS}" = "Failed" ] || [ "${STATUS}" = "PartiallyFailed" ]; then
echo "ERROR: Restore failed or partially failed"
velero restore logs ${BACKUP_NAME}
exit 1
fi
echo "Restore status: ${STATUS}"
sleep 10
done
# Verify pods are running
echo "Verifying pods..."
kubectl get pods -n octollm
echo "==================================="
echo "Cluster recovery completed"
echo "==================================="
Emergency Procedures
Critical Service Down
#!/bin/bash
# emergency-recovery.sh
set -e
SERVICE="${1}"
case ${SERVICE} in
"postgresql")
echo "Emergency PostgreSQL recovery..."
# Try restarting first
kubectl rollout restart statefulset/postgresql -n octollm
# If restart fails, restore from latest backup
if ! kubectl wait --for=condition=ready pod -l app=postgresql -n octollm --timeout=300s; then
echo "Restart failed, restoring from backup..."
LATEST_BACKUP=$(aws s3 ls s3://octollm-backups/postgresql/ | sort | tail -n 1 | awk '{print $4}')
./restore-postgres-full.sh ${LATEST_BACKUP}
fi
;;
"qdrant")
echo "Emergency Qdrant recovery..."
kubectl rollout restart statefulset/qdrant -n octollm
;;
"orchestrator")
echo "Emergency Orchestrator recovery..."
kubectl rollout restart deployment/orchestrator -n octollm
;;
*)
echo "Unknown service: ${SERVICE}"
echo "Supported services: postgresql, qdrant, orchestrator"
exit 1
;;
esac
echo "Emergency recovery initiated for ${SERVICE}"
RTO and RPO Targets
Service Tier Definitions
| Tier | Services | Description |
|---|---|---|
| Critical | Orchestrator, PostgreSQL, API Gateway | Core services required for operation |
| Important | Arms (all), Qdrant, Redis | Specialist services and data stores |
| Standard | Monitoring, Logging, Metrics | Observability and support services |
| Archive | Historical data, Audit logs | Long-term storage and compliance |
Recovery Time Objectives
| Tier | RTO | Justification | Recovery Procedure |
|---|---|---|---|
| Critical | 1 hour | Service disruption impacts all users | Automated failover + hot standby |
| Important | 4 hours | Graceful degradation possible | Restore from backup + warm standby |
| Standard | 24 hours | Non-essential for core operation | Manual restore from daily backup |
| Archive | 7 days | Historical data, rarely accessed | Restore from cold storage |
Recovery Point Objectives
| Tier | RPO | Backup Frequency | Acceptable Data Loss |
|---|---|---|---|
| Critical | 5 minutes | Continuous (WAL) + Hourly | <5 minutes of transactions |
| Important | 1 hour | Every 6 hours | <1 hour of task history |
| Standard | 24 hours | Daily | <24 hours of logs |
| Archive | 7 days | Weekly | <7 days of historical data |
Testing Schedule
| Test Type | Frequency | Scope | Duration | Success Criteria |
|---|---|---|---|---|
| Backup Verification | Daily | All backups | 15 min | Backup exists, correct size |
| Partial Restore | Weekly | Single table | 30 min | Data restored correctly |
| Full Database Restore | Monthly | PostgreSQL | 2 hours | Complete restoration + validation |
| Cluster Failover | Quarterly | Full cluster | 4 hours | All services operational |
| DR Drill | Annually | Complete DR | 8 hours | Full recovery from zero |
Disaster Scenarios
Complete Cluster Failure
Scenario: Entire Kubernetes cluster becomes unavailable due to catastrophic failure.
Detection:
- All health checks failing
- No pods responding
- kubectl commands timeout
- Monitoring shows complete outage
Response Procedure:
-
Assess Damage (5 minutes)
# Check cluster status kubectl cluster-info kubectl get nodes kubectl get pods --all-namespaces -
Activate DR Plan (10 minutes)
# Notify stakeholders ./notify-incident.sh "Cluster failure detected" # Provision new cluster if needed eksctl create cluster \ --name octollm-dr \ --region us-west-2 \ --nodegroup-name standard-workers \ --node-type m5.xlarge \ --nodes 5 -
Restore Infrastructure (30 minutes)
# Install Velero velero install --provider aws ... # Restore latest cluster backup LATEST_BACKUP=$(velero backup get | tail -n 1 | awk '{print $1}') velero restore create --from-backup ${LATEST_BACKUP} # Monitor restoration velero restore describe ${LATEST_BACKUP} -
Restore Data Stores (2 hours)
# Restore PostgreSQL ./restore-postgres-full.sh $(latest_postgres_backup) # Restore Qdrant ./restore-qdrant.sh --all-collections # Redis will rebuild cache automatically -
Validate Services (30 minutes)
# Run smoke tests ./smoke-tests.sh # Verify data integrity ./verify-data-integrity.sh -
Resume Operations (15 minutes)
# Update DNS to point to new cluster ./update-dns.sh # Notify stakeholders of recovery ./notify-incident.sh "Services restored"
Total RTO: ~4 hours
Database Corruption
Scenario: PostgreSQL database becomes corrupted, queries failing.
Detection:
- PostgreSQL errors in logs
- Data integrity check failures
- Query timeouts
- Inconsistent data returned
Response Procedure:
-
Isolate Problem (5 minutes)
# Stop writes to database kubectl scale deployment/orchestrator -n octollm --replicas=0 # Check corruption extent kubectl exec -n octollm postgresql-0 -- psql -U octollm -c "\ SELECT datname, pg_database_size(datname) \ FROM pg_database WHERE datname = 'octollm';" -
Assess Damage (10 minutes)
# Run integrity checks kubectl exec -n octollm postgresql-0 -- psql -U octollm -d octollm -c "\ SELECT schemaname, tablename, \ pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) \ FROM pg_tables WHERE schemaname = 'public';" # Check for corrupt tables kubectl exec -n octollm postgresql-0 -- vacuumdb --analyze-only -U octollm octollm -
Determine Recovery Strategy (5 minutes)
- Minor corruption: Repair in place
- Major corruption: Restore from backup
-
Execute Recovery (1-2 hours)
Option A: Repair in place (if minor)
# Reindex database kubectl exec -n octollm postgresql-0 -- psql -U octollm -d octollm -c "REINDEX DATABASE octollm;" # Run vacuum kubectl exec -n octollm postgresql-0 -- vacuumdb --full -U octollm octollmOption B: Restore from backup (if major)
# Point-in-time recovery to before corruption CORRUPTION_TIME="2025-11-10 10:00:00 UTC" ./restore-postgres-pitr.sh "${CORRUPTION_TIME}" -
Validate Restoration (15 minutes)
# Run data integrity tests ./test-database-integrity.sh # Verify row counts kubectl exec -n octollm postgresql-0 -- psql -U octollm -d octollm -c "\ SELECT 'entities', COUNT(*) FROM entities UNION ALL SELECT 'task_history', COUNT(*) FROM task_history;" -
Resume Operations (10 minutes)
# Restart services kubectl scale deployment/orchestrator -n octollm --replicas=3 # Monitor for issues kubectl logs -f -l app=orchestrator -n octollm
Total RTO: 2-4 hours (depending on corruption extent)
Accidental Deletion
Scenario: Critical data accidentally deleted by user or system error.
Detection:
- User reports missing data
- Monitoring shows sudden drop in row counts
- Application errors due to missing records
Response Procedure:
-
Identify Scope (5 minutes)
-- Check recent deletions in audit log SELECT * FROM action_log WHERE action_type = 'DELETE' AND timestamp > NOW() - INTERVAL '1 hour' ORDER BY timestamp DESC; -
Stop Further Damage (5 minutes)
# Disable write access temporarily kubectl scale deployment/orchestrator -n octollm --replicas=0 # Backup current state pg_dump -U octollm octollm > /tmp/current-state-$(date +%s).sql -
Restore Deleted Data (30 minutes)
Option A: Restore from audit trail (if tracked)
-- Find deleted records in audit SELECT action_details FROM action_log WHERE action_type = 'DELETE' AND timestamp > '2025-11-10 10:00:00'; -- Restore records INSERT INTO entities (id, entity_type, name, properties) SELECT ... FROM action_log WHERE ...;Option B: Point-in-time recovery
# Determine deletion time DELETION_TIME="2025-11-10 10:15:00 UTC" # Restore to just before deletion RESTORE_TIME=$(date -d "${DELETION_TIME} -5 minutes" +"%Y-%m-%d %H:%M:%S UTC") ./restore-postgres-pitr.sh "${RESTORE_TIME}"Option C: Partial restore from backup
# Restore specific tables ./restore-postgres-partial.sh latest-backup.sql.gz entities -
Validate Recovery (10 minutes)
# Verify restored data ./verify-restored-data.sh # Check for consistency kubectl exec -n octollm postgresql-0 -- psql -U octollm -d octollm -c "\ SELECT COUNT(*) FROM entities WHERE deleted_at IS NOT NULL;" -
Resume Operations (5 minutes)
kubectl scale deployment/orchestrator -n octollm --replicas=3
Total RTO: 1 hour Total RPO: 5 minutes (if using PITR)
Security Breach
Scenario: Unauthorized access detected, potential data compromise.
Detection:
- Intrusion detection alerts
- Unusual activity patterns
- Unauthorized API calls
- Data exfiltration detected
Response Procedure:
-
Contain Breach (IMMEDIATE)
# Isolate compromised systems kubectl cordon <compromised-node> # Block external access kubectl patch service api-gateway -n octollm -p '{"spec":{"type":"ClusterIP"}}' # Revoke credentials ./revoke-all-tokens.sh -
Assess Damage (30 minutes)
# Check audit logs kubectl exec -n octollm postgresql-0 -- psql -U octollm -d octollm -c "\ SELECT * FROM audit_logs WHERE timestamp > NOW() - INTERVAL '24 hours' ORDER BY timestamp DESC;" # Identify compromised data ./identify-compromised-data.sh -
Preserve Evidence (15 minutes)
# Snapshot all volumes ./snapshot-all-volumes.sh # Export logs kubectl logs --all-containers=true -n octollm > /evidence/logs-$(date +%s).txt # Backup current state ./backup-forensic-evidence.sh -
Rebuild from Clean State (4 hours)
# Create new cluster eksctl create cluster --name octollm-secure --config secure-cluster.yaml # Deploy with new credentials ./deploy-octollm.sh --new-credentials # Restore data from pre-breach backup LAST_GOOD_BACKUP=$(find_backup_before_breach) ./restore-postgres-full.sh ${LAST_GOOD_BACKUP} -
Strengthen Security (2 hours)
# Rotate all secrets ./rotate-all-secrets.sh # Update security policies kubectl apply -f network-policies-strict.yaml # Enable additional monitoring ./enable-enhanced-monitoring.sh -
Resume Operations (30 minutes)
# Gradual rollout ./gradual-rollout.sh --canary # Monitor for suspicious activity ./monitor-security-metrics.sh
Total RTO: 8 hours (security takes priority over speed) Total RPO: Varies based on breach timeline
Regional Outage
Scenario: Entire AWS region becomes unavailable.
Detection:
- AWS status page shows outage
- All services in region unreachable
- Multi-AZ redundancy failing
- Cross-region health checks failing
Response Procedure:
-
Confirm Outage (5 minutes)
# Check AWS status aws health describe-events --region us-east-1 # Verify cross-region connectivity curl https://health-check.octollm.example.com/us-west-2 -
Activate DR Region (15 minutes)
# Switch to DR cluster (us-west-2) export KUBECONFIG=~/.kube/config-us-west-2 kubectl cluster-info # Verify DR cluster status kubectl get pods -n octollm -
Sync Data (1 hour)
# Promote read replica to primary kubectl exec -n octollm postgresql-0 -- psql -U postgres -c "SELECT pg_promote();" # Verify data currency ./verify-data-freshness.sh # If data is stale, restore from S3 (cross-region replicated) ./restore-postgres-full.sh latest-cross-region-backup.sql.gz -
Update DNS (15 minutes)
# Update Route53 to point to DR region aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890ABC \ --change-batch file://update-dns-to-dr.json # Verify DNS propagation dig api.octollm.example.com -
Monitor Performance (30 minutes)
# Ensure DR region can handle load kubectl top nodes kubectl top pods -n octollm # Scale if necessary kubectl scale deployment orchestrator -n octollm --replicas=5 -
Communicate Status (15 minutes)
# Notify users of region switch ./notify-users.sh "Service restored in alternate region" # Update status page ./update-status-page.sh "Operational (DR region)"
Total RTO: 2 hours Total RPO: Depends on replication lag (typically <5 minutes)
Ransomware Attack
Scenario: Ransomware encrypts data, demands payment.
Detection:
- Sudden inability to read data
- Ransom note files appearing
- Unusual file modifications
- Encryption processes detected
Response Procedure:
-
Isolate Immediately (IMMEDIATE - 5 minutes)
# Disconnect from network kubectl patch service api-gateway -n octollm -p '{"spec":{"type":"ClusterIP"}}' # Stop all pods kubectl scale deployment --all -n octollm --replicas=0 kubectl scale statefulset --all -n octollm --replicas=0 # Quarantine infected nodes kubectl cordon --all -
Assess Damage (15 minutes)
# Check which files are encrypted ./identify-encrypted-files.sh # Determine infection vector ./analyze-attack-vector.sh # Preserve forensic evidence ./snapshot-compromised-volumes.sh -
DO NOT PAY RANSOM (policy decision)
- Document the ransom demand
- Report to law enforcement
- Proceed with restoration from backups
-
Rebuild Infrastructure (2 hours)
# Create completely new cluster eksctl create cluster --name octollm-clean --config cluster.yaml # Deploy fresh OctoLLM installation helm install octollm ./charts/octollm \ --namespace octollm \ --create-namespace \ --values values-production.yaml -
Restore from Clean Backups (2 hours)
# Identify last known good backup (before infection) LAST_CLEAN_BACKUP=$(identify_clean_backup) # Verify backup not encrypted aws s3 cp s3://octollm-backups/postgresql/${LAST_CLEAN_BACKUP} /tmp/test.sql.gz gunzip -t /tmp/test.sql.gz # Test integrity # Restore database ./restore-postgres-full.sh ${LAST_CLEAN_BACKUP} # Restore vector stores ./restore-qdrant.sh --all-collections --before-date "2025-11-09" -
Security Hardening (2 hours)
# Rotate ALL credentials ./rotate-all-secrets.sh --force # Update to latest security patches kubectl set image deployment/orchestrator orchestrator=octollm/orchestrator:latest-patched # Enable enhanced security kubectl apply -f network-policies-lockdown.yaml kubectl apply -f pod-security-policies-strict.yaml -
Validation (1 hour)
# Run security scans ./run-security-scan.sh # Verify no malware ./malware-scan.sh # Test all functionality ./integration-tests.sh -
Resume Operations (30 minutes)
# Gradual rollout with monitoring ./gradual-rollout.sh --extra-monitoring # Notify stakeholders ./notify-stakeholders.sh "Systems restored, enhanced security enabled"
Total RTO: 8 hours Total RPO: Depends on when infection started (data loss possible)
Configuration Error
Scenario: Incorrect configuration causes service disruption.
Detection:
- Services failing after configuration change
- Validation errors in logs
- Pods in CrashLoopBackOff
- Connectivity issues
Response Procedure:
-
Identify Change (5 minutes)
# Check recent changes kubectl rollout history deployment/orchestrator -n octollm # View recent configmap changes kubectl describe configmap octollm-config -n octollm # Check audit logs kubectl get events -n octollm --sort-by='.lastTimestamp' -
Rollback Configuration (5 minutes)
# Rollback to previous version kubectl rollout undo deployment/orchestrator -n octollm # Or restore from configuration backup kubectl apply -f /backups/k8s-configs/latest/configmaps.yaml -
Verify Service Restoration (10 minutes)
# Check pod status kubectl get pods -n octollm # Verify services responding curl https://api.octollm.example.com/health # Run smoke tests ./smoke-tests.sh -
Root Cause Analysis (30 minutes)
# Compare configurations diff /backups/k8s-configs/latest/configmaps.yaml \ /backups/k8s-configs/previous/configmaps.yaml # Document issue ./document-incident.sh "Configuration error in orchestrator" -
Fix and Redeploy (1 hour)
# Fix configuration vim configs/orchestrator-config.yaml # Validate configuration ./validate-config.sh configs/orchestrator-config.yaml # Deploy with canary kubectl apply -f configs/orchestrator-config.yaml ./canary-deploy.sh orchestrator
Total RTO: 1 hour Total RPO: 0 (no data loss)
Failed Deployment
Scenario: New deployment breaks production services.
Detection:
- Deployment fails validation
- Pods in Error state
- Increased error rates
- User reports of issues
Response Procedure:
-
Halt Deployment (IMMEDIATE - 2 minutes)
# Pause rollout kubectl rollout pause deployment/orchestrator -n octollm # Scale down new version kubectl scale deployment/orchestrator -n octollm --replicas=0 -
Assess Impact (5 minutes)
# Check error rates kubectl logs -l app=orchestrator,version=new -n octollm | grep ERROR | wc -l # Check user impact ./check-user-impact.sh -
Rollback (5 minutes)
# Rollback deployment kubectl rollout undo deployment/orchestrator -n octollm # Wait for rollback to complete kubectl rollout status deployment/orchestrator -n octollm -
Verify Services (10 minutes)
# Run health checks ./health-check.sh # Monitor metrics kubectl top pods -n octollm # Check user-facing functionality ./smoke-tests.sh -
Investigate Failure (1 hour)
# Collect logs kubectl logs -l version=failed -n octollm > /tmp/failed-deployment.log # Analyze errors ./analyze-deployment-failure.sh /tmp/failed-deployment.log # Identify root cause ./root-cause-analysis.sh -
Fix and Retry (2 hours)
# Fix issues git commit -m "Fix deployment issue: ..." # Build new version docker build -t octollm/orchestrator:v1.2.1-fixed . docker push octollm/orchestrator:v1.2.1-fixed # Deploy with canary ./canary-deploy.sh orchestrator v1.2.1-fixed
Total RTO: 30 minutes Total RPO: 0 (no data loss)
Network Partition
Scenario: Network failure causes cluster split-brain.
Detection:
- Nodes reporting as Not Ready
- Services unreachable from some nodes
- Inconsistent data reads
- Replication lag increasing
Response Procedure:
-
Identify Partition (10 minutes)
# Check node connectivity kubectl get nodes # Check pod distribution kubectl get pods -n octollm -o wide # Test inter-node connectivity ./test-network-connectivity.sh -
Determine Primary Partition (5 minutes)
# Identify partition with majority of nodes TOTAL_NODES=$(kubectl get nodes | wc -l) HEALTHY_NODES=$(kubectl get nodes | grep " Ready " | wc -l) # Primary partition should have >50% of nodes if [ $HEALTHY_NODES -gt $((TOTAL_NODES / 2)) ]; then echo "Primary partition identified" fi -
Cordon Unreachable Nodes (5 minutes)
# Prevent scheduling on partitioned nodes kubectl cordon <node-name> # Drain workloads from partitioned nodes kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data -
Force Reschedule (10 minutes)
# Delete pods on partitioned nodes kubectl delete pods -n octollm --field-selector spec.nodeName=<partitioned-node> # Wait for rescheduling on healthy nodes kubectl wait --for=condition=ready pod -l app=orchestrator -n octollm --timeout=300s -
Verify Data Consistency (15 minutes)
# Check PostgreSQL replication status kubectl exec -n octollm postgresql-0 -- psql -U postgres -c "\ SELECT client_addr, state, sync_state, replay_lag FROM pg_stat_replication;" # Run consistency checks ./verify-data-consistency.sh -
Restore Network (varies)
# Work with infrastructure team to restore connectivity # Once restored, uncordon nodes kubectl uncordon <node-name> # Verify cluster health kubectl get nodes kubectl get pods -n octollm
Total RTO: 1 hour (depending on network restoration) Total RPO: 5 minutes (replication lag)
Data Center Failure
Scenario: Entire data center becomes unavailable.
Detection:
- All services in availability zone down
- Physical infrastructure alerts
- Cloud provider notifications
- Complete loss of connectivity to AZ
Response Procedure:
-
Confirm Scope (5 minutes)
# Check affected availability zones kubectl get nodes -o wide # Identify pods in affected AZ kubectl get pods -n octollm -o wide | grep <affected-az> -
Failover to Other AZs (15 minutes)
# Cordon nodes in affected AZ kubectl cordon -l topology.kubernetes.io/zone=<affected-az> # Delete pods in affected AZ (force reschedule) kubectl delete pods -n octollm --field-selector spec.nodeName=<node-in-affected-az> # Scale up in healthy AZs kubectl scale deployment orchestrator -n octollm --replicas=5 -
Verify Redundancy (10 minutes)
# Check pod distribution kubectl get pods -n octollm -o wide | awk '{print $7}' | sort | uniq -c # Ensure no single point of failure ./verify-multi-az-distribution.sh -
Monitor Performance (30 minutes)
# Check resource usage in remaining AZs kubectl top nodes # Monitor queue depths ./monitor-queue-depths.sh # Scale if necessary ./autoscale-if-needed.sh -
Data Store Failover (1 hour)
# Promote PostgreSQL replica in healthy AZ kubectl exec -n octollm postgresql-1 -- psql -U postgres -c "SELECT pg_promote();" # Update connection strings ./update-postgres-connection.sh postgresql-1 # Verify data integrity ./verify-data-integrity.sh -
Long-term Mitigation (varies)
# Wait for data center restoration or # Permanently shift capacity to other AZs ./rebalance-cluster.sh
Total RTO: 2 hours Total RPO: 5 minutes (if replication was working)
Backup Automation
Automated Backup Jobs
All backup jobs run automatically on schedules:
| Component | Schedule | Retention | Storage Class |
|---|---|---|---|
| PostgreSQL Full | Daily (2 AM) | 30 days | STANDARD_IA → GLACIER |
| PostgreSQL WAL | Continuous | 7 days | STANDARD |
| Qdrant Snapshots | Every 6 hours | 14 days | STANDARD_IA |
| Redis RDB | Daily (3 AM) | 7 days | STANDARD_IA |
| Kubernetes Configs | Daily (1 AM) | 30 days | STANDARD_IA |
| Velero Cluster | Daily (1 AM) | 30 days | STANDARD |
Backup Verification
Automated verification ensures backups are restorable:
import boto3
from datetime import datetime, timedelta
import structlog
logger = structlog.get_logger()
class BackupVerifier:
"""Verify backup integrity and completeness."""
def __init__(self, s3_bucket: str):
self.s3_client = boto3.client('s3')
self.s3_bucket = s3_bucket
def verify_all_backups(self) -> dict:
"""Run verification checks on all backup types."""
results = {
"timestamp": datetime.utcnow().isoformat(),
"postgresql": self.verify_postgresql_backups(),
"qdrant": self.verify_qdrant_backups(),
"redis": self.verify_redis_backups(),
"k8s_configs": self.verify_k8s_config_backups(),
"overall_status": "unknown"
}
# Determine overall status
statuses = [v["status"] for v in results.values() if isinstance(v, dict) and "status" in v]
if all(s == "healthy" for s in statuses):
results["overall_status"] = "healthy"
elif any(s == "critical" for s in statuses):
results["overall_status"] = "critical"
else:
results["overall_status"] = "warning"
return results
def verify_postgresql_backups(self) -> dict:
"""Verify PostgreSQL backup health."""
try:
# List recent backups
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='postgresql/',
MaxKeys=10
)
if 'Contents' not in response or len(response['Contents']) == 0:
return {
"status": "critical",
"message": "No PostgreSQL backups found",
"last_backup": None
}
# Get latest backup
latest = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)[0]
backup_age = datetime.now(latest['LastModified'].tzinfo) - latest['LastModified']
size_mb = latest['Size'] / (1024 * 1024)
# Check if backup is recent (within 25 hours for daily backup)
if backup_age > timedelta(hours=25):
status = "critical"
message = f"Latest backup is {backup_age.days} days old"
elif size_mb < 1:
status = "critical"
message = f"Latest backup is too small: {size_mb:.2f} MB"
else:
status = "healthy"
message = "PostgreSQL backups are current"
# Check WAL archives
wal_response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='wal/',
MaxKeys=10
)
wal_status = "healthy" if 'Contents' in wal_response else "warning"
return {
"status": status,
"message": message,
"last_backup": latest['LastModified'].isoformat(),
"backup_age_hours": backup_age.total_seconds() / 3600,
"backup_size_mb": size_mb,
"wal_status": wal_status,
"backup_count": len(response['Contents'])
}
except Exception as e:
logger.error("postgresql_backup_verification_failed", error=str(e))
return {
"status": "critical",
"message": f"Verification failed: {str(e)}"
}
def verify_qdrant_backups(self) -> dict:
"""Verify Qdrant snapshot backups."""
try:
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='qdrant/',
MaxKeys=50
)
if 'Contents' not in response:
return {
"status": "critical",
"message": "No Qdrant backups found"
}
# Group by collection
collections = {}
for obj in response['Contents']:
parts = obj['Key'].split('/')
if len(parts) >= 2:
collection = parts[1]
if collection not in collections:
collections[collection] = []
collections[collection].append(obj)
# Check each collection
issues = []
for collection, backups in collections.items():
latest = max(backups, key=lambda x: x['LastModified'])
backup_age = datetime.now(latest['LastModified'].tzinfo) - latest['LastModified']
if backup_age > timedelta(hours=7): # 6-hour schedule + 1 hour buffer
issues.append(f"{collection}: {backup_age.total_seconds() / 3600:.1f}h old")
if issues:
return {
"status": "warning",
"message": "Some collections have stale backups",
"issues": issues,
"collections": len(collections)
}
else:
return {
"status": "healthy",
"message": "All Qdrant collections backed up",
"collections": len(collections)
}
except Exception as e:
logger.error("qdrant_backup_verification_failed", error=str(e))
return {
"status": "critical",
"message": f"Verification failed: {str(e)}"
}
def verify_redis_backups(self) -> dict:
"""Verify Redis backup health."""
try:
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='redis/',
MaxKeys=10
)
if 'Contents' not in response:
return {
"status": "warning",
"message": "No Redis backups found (cache is ephemeral)"
}
latest = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)[0]
backup_age = datetime.now(latest['LastModified'].tzinfo) - latest['LastModified']
if backup_age > timedelta(hours=25):
status = "warning"
message = f"Redis backup is {backup_age.days} days old"
else:
status = "healthy"
message = "Redis backups are current"
return {
"status": status,
"message": message,
"last_backup": latest['LastModified'].isoformat()
}
except Exception as e:
logger.error("redis_backup_verification_failed", error=str(e))
return {
"status": "warning",
"message": f"Verification failed: {str(e)}"
}
def verify_k8s_config_backups(self) -> dict:
"""Verify Kubernetes configuration backups."""
try:
response = self.s3_client.list_objects_v2(
Bucket=self.s3_bucket,
Prefix='k8s-configs/',
MaxKeys=10
)
if 'Contents' not in response:
return {
"status": "critical",
"message": "No K8s config backups found"
}
latest = sorted(response['Contents'], key=lambda x: x['LastModified'], reverse=True)[0]
backup_age = datetime.now(latest['LastModified'].tzinfo) - latest['LastModified']
if backup_age > timedelta(hours=25):
status = "warning"
message = f"Config backup is {backup_age.days} days old"
else:
status = "healthy"
message = "K8s config backups are current"
return {
"status": status,
"message": message,
"last_backup": latest['LastModified'].isoformat()
}
except Exception as e:
logger.error("k8s_backup_verification_failed", error=str(e))
return {
"status": "critical",
"message": f"Verification failed: {str(e)}"
}
# Run daily verification
# verifier = BackupVerifier(s3_bucket='octollm-backups')
# results = verifier.verify_all_backups()
#
# if results['overall_status'] == 'critical':
# send_alert("CRITICAL: Backup verification failed", results)
# elif results['overall_status'] == 'warning':
# send_alert("WARNING: Backup issues detected", results)
Retention Policies
Automated retention management with lifecycle policies:
{
"Rules": [
{
"Id": "PostgreSQL-Full-Backup-Lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "postgresql/"
},
"Transitions": [
{
"Days": 7,
"StorageClass": "STANDARD_IA"
},
{
"Days": 30,
"StorageClass": "GLACIER_IR"
},
{
"Days": 90,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 365
},
"NoncurrentVersionExpiration": {
"NoncurrentDays": 30
}
},
{
"Id": "WAL-Archive-Lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "wal/"
},
"Expiration": {
"Days": 7
}
},
{
"Id": "Qdrant-Snapshot-Lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "qdrant/"
},
"Transitions": [
{
"Days": 7,
"StorageClass": "STANDARD_IA"
}
],
"Expiration": {
"Days": 14
}
},
{
"Id": "Redis-Backup-Lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "redis/"
},
"Transitions": [
{
"Days": 3,
"StorageClass": "STANDARD_IA"
}
],
"Expiration": {
"Days": 7
}
}
]
}
Monitoring and Alerting
Comprehensive monitoring of backup health:
# Prometheus AlertManager rules
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-backup-alerts
namespace: monitoring
data:
backup-alerts.yml: |
groups:
- name: backup_alerts
interval: 5m
rules:
# PostgreSQL backup age
- alert: PostgreSQLBackupStale
expr: octollm_postgresql_backup_age_hours > 25
for: 1h
labels:
severity: critical
component: postgresql
annotations:
summary: "PostgreSQL backup is stale"
description: "Last PostgreSQL backup is {{ $value }} hours old (threshold: 25h)"
# PostgreSQL backup size
- alert: PostgreSQLBackupTooSmall
expr: octollm_postgresql_backup_size_mb < 1
for: 5m
labels:
severity: critical
component: postgresql
annotations:
summary: "PostgreSQL backup suspiciously small"
description: "Latest backup is only {{ $value }} MB"
# Backup failures
- alert: BackupFailureRate
expr: rate(octollm_postgresql_backup_failures_total[1h]) > 0.1
for: 5m
labels:
severity: warning
component: backup
annotations:
summary: "High backup failure rate"
description: "Backup failure rate is {{ $value }}/hour"
# Qdrant backup missing
- alert: QdrantBackupMissing
expr: time() - octollm_qdrant_last_backup_timestamp > 25200 # 7 hours
for: 1h
labels:
severity: warning
component: qdrant
annotations:
summary: "Qdrant backup is missing"
description: "No Qdrant backup in last 7 hours"
# Velero backup failures
- alert: VeleroBackupFailed
expr: velero_backup_failure_total > 0
for: 5m
labels:
severity: critical
component: velero
annotations:
summary: "Velero backup failed"
description: "Velero backup has failed {{ $value }} times"
Due to length constraints, I'll continue with the remaining sections in a follow-up. The document is currently at approximately 1,800 lines. Would you like me to complete the remaining sections:
- Testing and Validation
- Compliance and Audit
- Incident Response
- Multi-Region Deployment