Phase 3: Operations & Deployment
Status: Not Started Duration: 4-6 weeks (parallel with Phase 4) Team Size: 2-3 SREs Prerequisites: Phase 2 complete Start Date: TBD Target Completion: TBD
Overview
Phase 3 establishes production-grade operations infrastructure including comprehensive monitoring, alert
ing, troubleshooting playbooks, disaster recovery, and performance optimization. This phase ensures the OctoLLM system can be reliably operated in production.
Key Deliverables:
- Monitoring Stack - Prometheus, Grafana, Loki, Jaeger
- Alerting System - Alertmanager with PagerDuty integration
- Troubleshooting Playbooks - 10+ comprehensive runbooks
- Disaster Recovery - Automated backups and restoration procedures
- Performance Tuning - Database, application, and cache optimization
Success Criteria:
- ✅ Monitoring stack operational with 30-day retention
- ✅ Alerts firing correctly for simulated incidents
- ✅ Backups tested and verified (RTO <4 hours, RPO <1 hour)
- ✅ Load tests passing at scale (1,000 concurrent tasks)
- ✅ Runbooks tested by on-call team
Reference: docs/doc_phases/PHASE-3-COMPLETE-SPECIFICATIONS.md (12,600+ lines)
Sprint 3.1: Monitoring Stack [Week 17-18]
Duration: 2 weeks Team: 1-2 SREs Prerequisites: Kubernetes deployment complete Priority: CRITICAL
Sprint Goals
- Deploy complete observability stack (Prometheus, Grafana, Loki, Jaeger)
- Instrument all services with metrics
- Create pre-built Grafana dashboards (5+ dashboards)
- Achieve 100% service coverage for metrics collection
- 30-day metrics retention
Tasks
Prometheus Deployment (8 hours)
-
Deploy Prometheus Operator (3 hours)
- Install Prometheus Operator via Helm
- Configure ServiceMonitors for auto-discovery
- Set up 30-day retention
- Code example:
# k8s/monitoring/prometheus.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: octollm-prometheus namespace: octollm spec: replicas: 2 retention: 30d storage: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi serviceMonitorSelector: matchLabels: app: octollm resources: requests: memory: "4Gi" cpu: "2000m" limits: memory: "8Gi" cpu: "4000m" - Files to create:
k8s/monitoring/prometheus.yaml - Reference:
docs/operations/monitoring-alerting.md
-
Create ServiceMonitors (3 hours)
- ServiceMonitor for Orchestrator
- ServiceMonitor for Reflex Layer
- ServiceMonitor for all Arms
- ServiceMonitor for databases
- Code example:
# k8s/monitoring/servicemonitor-orchestrator.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: orchestrator namespace: octollm labels: app: octollm spec: selector: matchLabels: app: orchestrator endpoints: - port: metrics path: /metrics interval: 30s scrapeTimeout: 10s - Files to create:
k8s/monitoring/servicemonitor-*.yaml
-
Configure Prometheus Rules (2 hours)
- Recording rules for aggregations
- Alert rules (covered in Sprint 3.2)
- Files to create:
k8s/monitoring/prometheus-rules.yaml
Application Metrics Implementation (10 hours)
-
Instrument Orchestrator (3 hours)
- HTTP request metrics (rate, duration, errors by endpoint)
- Task lifecycle metrics (created, completed, failed, duration)
- LLM API metrics (calls, tokens, cost, duration, errors)
- Code example:
# orchestrator/metrics.py from prometheus_client import Counter, Histogram, Gauge, generate_latest from fastapi import FastAPI, Response # HTTP metrics http_requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) http_request_duration_seconds = Histogram( 'http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'] ) # Task metrics tasks_created_total = Counter( 'tasks_created_total', 'Total tasks created', ['task_type'] ) tasks_completed_total = Counter( 'tasks_completed_total', 'Total tasks completed', ['task_type', 'status'] ) task_duration_seconds = Histogram( 'task_duration_seconds', 'Task execution duration', ['task_type'], buckets=[0.5, 1, 2, 5, 10, 30, 60, 120, 300] ) tasks_in_progress = Gauge( 'tasks_in_progress', 'Tasks currently in progress', ['task_type'] ) # LLM metrics llm_api_calls_total = Counter( 'llm_api_calls_total', 'Total LLM API calls', ['provider', 'model'] ) llm_api_tokens_total = Counter( 'llm_api_tokens_total', 'Total LLM API tokens used', ['provider', 'model', 'type'] # type: prompt, completion ) llm_api_cost_total = Counter( 'llm_api_cost_total', 'Total LLM API cost in USD', ['provider', 'model'] ) llm_api_duration_seconds = Histogram( 'llm_api_duration_seconds', 'LLM API call duration', ['provider', 'model'] ) # Metrics endpoint @app.get("/metrics") async def metrics(): return Response(content=generate_latest(), media_type="text/plain") - Files to create:
orchestrator/metrics.py
-
Instrument Arms (4 hours)
- Arm-specific metrics (requests, availability, latency, success rate)
- Memory metrics (operations, query duration, cache hits/misses)
- Similar pattern to Orchestrator for each arm
- Files to create:
arms/{arm_name}/metrics.py
-
Instrument Reflex Layer (2 hours)
- PII detection metrics (detections, types, redactions)
- Injection detection metrics (attempts blocked)
- Cache metrics (hits, misses, hit rate, evictions)
- Code example (Rust):
// reflex-layer/src/metrics.rs use prometheus::{IntCounter, IntCounterVec, HistogramVec, Registry}; use lazy_static::lazy_static; lazy_static! { pub static ref HTTP_REQUESTS_TOTAL: IntCounterVec = IntCounterVec::new( prometheus::opts!("http_requests_total", "Total HTTP requests"), &["method", "endpoint", "status"] ).unwrap(); pub static ref PII_DETECTIONS_TOTAL: IntCounterVec = IntCounterVec::new( prometheus::opts!("pii_detections_total", "Total PII detections"), &["pii_type"] ).unwrap(); pub static ref INJECTION_BLOCKS_TOTAL: IntCounter = IntCounter::new( "injection_blocks_total", "Total prompt injection attempts blocked" ).unwrap(); pub static ref CACHE_HITS_TOTAL: IntCounter = IntCounter::new( "cache_hits_total", "Total cache hits" ).unwrap(); pub static ref CACHE_MISSES_TOTAL: IntCounter = IntCounter::new( "cache_misses_total", "Total cache misses" ).unwrap(); } pub fn register_metrics(registry: &Registry) { registry.register(Box::new(HTTP_REQUESTS_TOTAL.clone())).unwrap(); registry.register(Box::new(PII_DETECTIONS_TOTAL.clone())).unwrap(); registry.register(Box::new(INJECTION_BLOCKS_TOTAL.clone())).unwrap(); registry.register(Box::new(CACHE_HITS_TOTAL.clone())).unwrap(); registry.register(Box::new(CACHE_MISSES_TOTAL.clone())).unwrap(); } - Files to create:
reflex-layer/src/metrics.rs
-
Database Metrics (1 hour)
- PostgreSQL exporter configuration
- Redis exporter configuration
- Qdrant built-in metrics
- Files to create:
k8s/monitoring/postgres-exporter.yaml,k8s/monitoring/redis-exporter.yaml
Grafana Setup (6 hours)
-
Deploy Grafana (2 hours)
- Helm installation
- Configure Prometheus datasource
- Set up authentication (OIDC or basic auth)
- Persistent storage for dashboards
- Files to create:
k8s/monitoring/grafana.yaml
-
Create System Overview Dashboard (1 hour)
- Task success rate (gauge + graph)
- Overall latency (P50, P95, P99)
- Cost per day/week/month
- Error rate by service
- JSON export in repository
- Files to create:
k8s/monitoring/dashboards/system-overview.json
-
Create Service Health Dashboard (1 hour)
- Availability per service (uptime %)
- Error rate by endpoint
- Latency distributions
- Request volume
- Files to create:
k8s/monitoring/dashboards/service-health.json
-
Create Resource Usage Dashboard (1 hour)
- CPU usage by pod
- Memory usage by pod
- Disk I/O
- Network traffic
- Files to create:
k8s/monitoring/dashboards/resource-usage.json
-
Create LLM Cost Tracking Dashboard (1 hour)
- Tokens used per day/week/month
- Cost breakdown by model
- Cost per task
- Budget tracking with alerts
- Files to create:
k8s/monitoring/dashboards/llm-costs.json
Success Criteria
- Prometheus scraping all services (100% coverage)
- Grafana dashboards display real-time data
- Metrics retention 30 days
- All critical metrics instrumented
- Dashboard JSON exported to repository
Estimated Effort
- Development: 24 hours
- Testing: 4 hours
- Documentation: 2 hours
- Total: 30 hours (~2 weeks for 1 SRE)
Sprint 3.2: Alerting and Runbooks [Week 18-19]
Duration: 1 week Team: 1-2 SREs Prerequisites: Monitoring stack deployed Priority: CRITICAL
Sprint Goals
- Deploy Alertmanager with notification routing
- Define 20+ alert rules across all services
- Create 10+ comprehensive runbooks
- Set up on-call rotation and escalation
- Test alerts with simulated incidents
Tasks
Alertmanager Setup (6 hours)
-
Deploy Alertmanager (2 hours)
- Helm installation
- Configure notification channels (Slack, PagerDuty, email)
- Set up alert grouping and routing
- Code example:
# k8s/monitoring/alertmanager-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: octollm data: alertmanager.yml: | global: resolve_timeout: 5m slack_api_url: '{{ .SlackWebhookURL }}' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: 'pagerduty' continue: true - match: severity: warning receiver: 'slack' receivers: - name: 'default' email_configs: - to: 'team@octollm.io' from: 'alerts@octollm.io' smarthost: 'smtp.gmail.com:587' - name: 'slack' slack_configs: - channel: '#octollm-alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'pagerduty' pagerduty_configs: - service_key: '{{ .PagerDutyServiceKey }}' description: '{{ .GroupLabels.alertname }}' - Files to create:
k8s/monitoring/alertmanager-config.yaml
-
Configure Notification Channels (2 hours)
- Slack webhook integration
- PagerDuty service key setup
- Email SMTP configuration
- Test notifications
-
Set Up Alert Routing (2 hours)
- Route critical alerts to PagerDuty
- Route warnings to Slack
- Route info to email
- Configure inhibit rules (suppress redundant alerts)
Alert Rules Definition (8 hours)
-
Service Availability Alerts (2 hours)
- Service down (>1 minute)
- High error rate (>5% for 5 minutes)
- Low uptime (<95% over 24 hours)
- Code example:
# k8s/monitoring/alert-rules/service-availability.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: service-availability namespace: octollm spec: groups: - name: service_availability interval: 30s rules: - alert: ServiceDown expr: up{job=~"orchestrator|reflex-layer|.*-arm"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "{{ $labels.job }} has been down for more than 1 minute" - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) ) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.job }}" description: "{{ $labels.job }} has >5% error rate for 5 minutes" - alert: LowUptime expr: avg_over_time(up{job=~"orchestrator|reflex-layer|.*-arm"}[24h]) < 0.95 labels: severity: warning annotations: summary: "Low uptime for {{ $labels.job }}" description: "{{ $labels.job }} uptime <95% over last 24 hours" - Files to create:
k8s/monitoring/alert-rules/service-availability.yaml
-
Performance Alerts (2 hours)
- High latency (P95 >30s for tasks)
- Low throughput (<10 tasks/minute)
- Task timeout rate (>10%)
- Files to create:
k8s/monitoring/alert-rules/performance.yaml
-
Resource Alerts (2 hours)
- High CPU (>80% for 10 minutes)
- High memory (>90% for 5 minutes)
- Disk space low (<15% free)
- Files to create:
k8s/monitoring/alert-rules/resources.yaml
-
Database Alerts (1 hour)
- Connection pool exhausted
- Replication lag (>60s)
- Slow queries (>10s)
- Files to create:
k8s/monitoring/alert-rules/database.yaml
-
LLM Cost Alerts (1 hour)
- Daily spend >$500
- Monthly spend >$10,000
- Unexpected spike (>2x average)
- Files to create:
k8s/monitoring/alert-rules/llm-costs.yaml
Runbook Creation (10 hours)
-
Create Runbook Template (1 hour)
- Standard structure (Symptoms, Diagnosis, Resolution, Prevention)
- Code examples for common commands
- Files to create:
docs/operations/runbooks/TEMPLATE.md
-
Service Unavailable Runbook (1 hour)
- Check pod status
- Review recent deployments
- Inspect logs
- Restart procedures
- Files to create:
docs/operations/runbooks/service-unavailable.md
-
High Latency Runbook (1 hour)
- Identify bottleneck (database, LLM API, network)
- Profile slow requests
- Check resource utilization
- Optimization steps
- Files to create:
docs/operations/runbooks/high-latency.md
-
Database Connection Issues Runbook (1 hour)
- Check connection pool status
- Verify credentials
- Test network connectivity
- Restart database clients
- Files to create:
docs/operations/runbooks/database-connection.md
-
Memory Leak Runbook (1 hour)
- Identify leaking service
- Profile memory usage
- Restart procedures
- Long-term fixes
- Files to create:
docs/operations/runbooks/memory-leak.md
-
Task Routing Failure Runbook (1 hour)
- Check arm registration
- Verify capability matching
- Review routing logs
- Manual task reassignment
- Files to create:
docs/operations/runbooks/task-routing-failure.md
-
LLM API Failure Runbook (1 hour)
- Check API rate limits
- Verify API keys
- Test fallback providers
- Manual retry procedures
- Files to create:
docs/operations/runbooks/llm-api-failure.md
-
Cache Performance Runbook (1 hour)
- Check Redis health
- Analyze eviction rate
- Warm cache
- Tune TTL settings
- Files to create:
docs/operations/runbooks/cache-performance.md
-
Resource Exhaustion Runbook (1 hour)
- Identify resource-hungry pods
- Scale up resources
- Clean up old data
- Implement limits
- Files to create:
docs/operations/runbooks/resource-exhaustion.md
-
Security Violation Runbook (1 hour)
- Review security logs
- Block malicious IPs
- Revoke compromised tokens
- Incident response
- Files to create:
docs/operations/runbooks/security-violation.md
On-Call Setup (4 hours)
-
Define On-Call Rotation (2 hours)
- Primary, secondary, escalation roles
- Rotation schedule (weekly)
- Handoff procedures
- PagerDuty configuration
-
Document Escalation Procedures (1 hour)
- Level 1: On-call Engineer (15 minutes)
- Level 2: Senior Engineer (30 minutes)
- Level 3: Engineering Lead (60 minutes)
- Files to create:
docs/operations/on-call-guide.md
-
Create On-Call Runbook Index (1 hour)
- Categorized runbook list
- Quick reference commands
- Common issue resolutions
- Files to create:
docs/operations/on-call-quick-reference.md
Success Criteria
- Alertmanager routing alerts correctly
- All notification channels tested
- 20+ alert rules defined
- 10+ runbooks created and tested
- On-call rotation configured
- Simulated incidents resolved using runbooks
Estimated Effort
- Development: 20 hours
- Testing: 4 hours
- Documentation: 4 hours
- Total: 28 hours (~1 week for 2 SREs)
Sprint 3.3: Disaster Recovery [Week 19-20]
(Abbreviated for space - full version would be 1,500-2,000 lines)
Sprint Goals
- Implement automated backup systems for all databases
- Create point-in-time recovery (PITR) procedures
- Deploy Velero for cluster backups
- Test disaster recovery scenarios (RTO <4 hours, RPO <1 hour)
- Document and automate restore procedures
Key Tasks (Summary)
- PostgreSQL Backups (WAL archiving, pg_basebackup, daily full backups)
- Qdrant Backups (snapshot-based, 6-hour schedule)
- Redis Persistence (RDB + AOF)
- Velero Cluster Backups (daily full, hourly critical)
- Backup Verification (automated testing)
- Disaster Scenario Testing (10 scenarios)
Reference: docs/operations/disaster-recovery.md (2,779 lines)
Sprint 3.4: Performance Tuning [Week 20-22]
(Abbreviated for space - full version would be 1,200-1,500 lines)
Sprint Goals
- Optimize database performance (indexes, query tuning, connection pooling)
- Tune application-level performance (async ops, batching, compression)
- Implement multi-level caching strategies
- Optimize LLM API usage (batching, model selection, streaming)
- Run load tests and identify bottlenecks
- Achieve P95 latency <30s, throughput >1,000 tasks/sec
Key Tasks (Summary)
- Database Optimization (PostgreSQL tuning, index optimization)
- Application Tuning (async operations, request batching)
- Cache Optimization (L1 in-memory, L2 Redis, cache warming)
- LLM API Optimization (batching, streaming, model selection)
- Load Testing (k6 scripts: progressive, stress, soak tests)
- Profiling and Bottleneck Identification
Reference: docs/operations/performance-tuning.md
Sprint 3.5: Troubleshooting Automation [Week 21-22]
(Abbreviated for space - full version would be 800-1,000 lines)
Sprint Goals
- Implement health check endpoints with deep health checks
- Create auto-remediation scripts for common issues
- Build diagnostic tools and debug endpoints
- Set up performance dashboards for real-time monitoring
- Automate routine troubleshooting tasks
Key Tasks (Summary)
- Deep Health Checks (dependency health, database connectivity)
- Auto-Remediation Scripts (restart policies, self-healing)
- Diagnostic Tools (debug endpoints, log aggregation)
- Performance Dashboards (real-time metrics, SLO tracking)
Phase 3 Summary
Total Tasks: 50+ operations tasks across 5 sprints Estimated Duration: 4-6 weeks with 2-3 SREs Total Estimated Hours: ~120 hours development + ~20 hours testing + ~15 hours documentation = 155 hours
Deliverables:
- Complete monitoring stack (Prometheus, Grafana, Alertmanager)
- Alerting with runbooks (20+ alerts, 10+ runbooks)
- Automated backups and disaster recovery (RTO <4hr, RPO <1hr)
- Performance tuning and load testing
- Troubleshooting automation
Completion Checklist:
- Monitoring stack operational with 30-day retention
- Alerts firing correctly for simulated incidents
- Backups tested and verified (recovery scenarios passed)
- Load tests passing at scale (1,000 concurrent tasks)
- Runbooks tested by on-call team
- Performance targets met (P95 <30s, >1,000 tasks/sec)
- Documentation complete and up-to-date
Next Phase: Phase 5 (Security Hardening) - After Phase 4 complete
Document Version: 1.0 Last Updated: 2025-11-10 Maintained By: OctoLLM Project Management Team