Alert Response Procedures
Document Version: 1.0.0 Last Updated: 2025-11-12 Owner: OctoLLM Operations Team Status: Production
Table of Contents
- Overview
- Response Workflow
- Critical Alert Procedures
- Warning Alert Procedures
- Informational Alert Procedures
- Multi-Alert Scenarios
- Escalation Decision Trees
- Post-Incident Actions
Overview
This document provides step-by-step procedures for responding to alerts from the OctoLLM monitoring system. Each procedure includes:
- Detection: How the alert is triggered
- Impact: What this means for users and the system
- Investigation Steps: How to diagnose the issue
- Remediation Actions: How to fix the problem
- Escalation Criteria: When to involve senior engineers or management
Alert Severity Levels:
- Critical: Immediate action required, user-impacting, PagerDuty notification
- Warning: Action required within 1 hour, potential user impact, Slack notification
- Info: No immediate action required, informational only, logged to Slack
Response Time SLAs:
- Critical: Acknowledge within 5 minutes, resolve within 1 hour
- Warning: Acknowledge within 30 minutes, resolve within 4 hours
- Info: Review within 24 hours
Response Workflow
General Alert Response Process
1. ACKNOWLEDGE
└─> Acknowledge alert in PagerDuty/Slack
└─> Note start time in incident tracker
2. ASSESS
└─> Check alert details (service, namespace, severity)
└─> Review recent deployments or changes
└─> Check for related alerts
3. INVESTIGATE
└─> Follow specific alert procedure (see sections below)
└─> Gather logs, metrics, traces
└─> Identify root cause
4. REMEDIATE
└─> Apply fix (restart, scale, rollback, etc.)
└─> Verify fix with metrics/logs
└─> Monitor for 10-15 minutes
5. DOCUMENT
└─> Update incident tracker with resolution
└─> Create post-incident review if critical
└─> Update runbooks if new issue discovered
6. CLOSE
└─> Resolve alert in PagerDuty/Slack
└─> Confirm no related alerts remain
Tools Quick Reference
- Grafana: https://grafana.octollm.dev
- Prometheus: https://prometheus.octollm.dev
- Jaeger: https://jaeger.octollm.dev
- Alertmanager: https://alertmanager.octollm.dev
- kubectl: CLI access to Kubernetes cluster
Critical Alert Procedures
1. PodCrashLoopBackOff
Alert Definition:
alert: PodCrashLoopBackOff
expr: rate(kube_pod_container_status_restarts_total{namespace=~"octollm.*"}[10m]) > 0.3
for: 5m
severity: critical
Impact: Service degradation or complete outage. Users may experience errors or timeouts.
Investigation Steps
Step 1: Identify the crashing pod
# List pods with high restart counts
kubectl get pods -n <namespace> --sort-by=.status.containerStatuses[0].restartCount
# Example output:
# NAME READY STATUS RESTARTS AGE
# orchestrator-7d9f8c-xk2p9 0/1 CrashLoopBackOff 12 30m
Step 2: Check pod logs
# Get recent logs from crashing container
kubectl logs -n <namespace> <pod-name> --tail=100
# Get logs from previous container instance
kubectl logs -n <namespace> <pod-name> --previous
# Common error patterns:
# - "Connection refused" → Dependency unavailable
# - "Out of memory" → Resource limits too low
# - "Panic: runtime error" → Code bug
# - "Permission denied" → RBAC or volume mount issue
Step 3: Check pod events
kubectl describe pod -n <namespace> <pod-name>
# Look for events like:
# - "Back-off restarting failed container"
# - "Error: ErrImagePull"
# - "FailedMount"
# - "OOMKilled"
Step 4: Check resource usage
# Check if pod is OOMKilled
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Check resource requests/limits
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].resources}'
Step 5: Check configuration
# Verify environment variables
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].env}'
# Check ConfigMap/Secret mounts
kubectl describe configmap -n <namespace> <configmap-name>
kubectl describe secret -n <namespace> <secret-name>
Remediation Actions
If: Connection refused to dependency (DB, Redis, etc.)
# 1. Check if dependency service is healthy
kubectl get pods -n <namespace> -l app=<dependency>
# 2. Test connectivity from within cluster
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod: nc -zv <service-name> <port>
# 3. Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# 4. If dependency is down, restart it first
kubectl rollout restart deployment/<dependency-name> -n <namespace>
# 5. Wait for dependency to be ready, then restart affected pod
kubectl delete pod -n <namespace> <pod-name>
If: Out of memory (OOMKilled)
# 1. Check current memory usage in Grafana
# Query: container_memory_usage_bytes{pod="<pod-name>"}
# 2. Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory (e.g., from 512Mi to 1Gi)
# 3. Monitor memory usage after restart
If: Image pull error
# 1. Check image name and tag
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].image}'
# 2. Verify image exists in registry
gcloud container images list --repository=gcr.io/<project-id>
# 3. Check image pull secrets
kubectl get secrets -n <namespace> | grep gcr
# 4. If image is wrong, update deployment
kubectl set image deployment/<deployment-name> <container-name>=<correct-image> -n <namespace>
If: Configuration error
# 1. Validate ConfigMap/Secret exists and has correct data
kubectl get configmap -n <namespace> <configmap-name> -o yaml
# 2. If config is wrong, update it
kubectl edit configmap -n <namespace> <configmap-name>
# 3. Restart pods to pick up new config
kubectl rollout restart deployment/<deployment-name> -n <namespace>
If: Code bug (panic, runtime error)
# 1. Check Jaeger for traces showing error
# Navigate to https://jaeger.octollm.dev
# Search for service: <service-name>, operation: <failing-operation>
# 2. Identify commit that introduced bug
kubectl get deployment -n <namespace> <deployment-name> -o jsonpath='{.spec.template.spec.containers[0].image}'
# 3. Rollback to previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# 4. Verify rollback
kubectl rollout status deployment/<deployment-name> -n <namespace>
# 5. Create incident ticket with logs/traces
# Subject: "CrashLoopBackOff in <service> due to <error>"
# Include: logs, traces, reproduction steps
If: Persistent volume mount failure
# 1. Check PVC status
kubectl get pvc -n <namespace>
# 2. Check PVC events
kubectl describe pvc -n <namespace> <pvc-name>
# 3. If PVC is pending, check storage class
kubectl get storageclass
# 4. If PVC is lost, restore from backup (see backup-restore.md)
Escalation Criteria
Escalate to Senior Engineer if:
- Root cause not identified within 15 minutes
- Multiple pods crashing across different services
- Rollback does not resolve the issue
- Data loss suspected
Escalate to Engineering Lead if:
- Critical service (orchestrator, reflex-layer) down for >30 minutes
- Root cause requires code fix (cannot be resolved via config/restart)
Escalate to VP Engineering if:
- Complete outage (all services down)
- Data corruption suspected
- Estimated resolution time >2 hours
2. NodeNotReady
Alert Definition:
alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="false"} == 1
for: 5m
severity: critical
Impact: Reduced cluster capacity. Pods on the node are evicted and rescheduled. Possible service degradation.
Investigation Steps
Step 1: Identify unhealthy node
# List all nodes with status
kubectl get nodes -o wide
# Example output:
# NAME STATUS ROLES AGE VERSION
# gke-cluster-pool-1-abc Ready <none> 10d v1.28.3
# gke-cluster-pool-1-def NotReady <none> 10d v1.28.3 ← Problem node
Step 2: Check node conditions
kubectl describe node <node-name>
# Look for conditions:
# - Ready: False
# - MemoryPressure: True
# - DiskPressure: True
# - PIDPressure: True
# - NetworkUnavailable: True
Step 3: Check node resource usage
# Check node metrics
kubectl top node <node-name>
# Query in Grafana:
# CPU: node_cpu_seconds_total{instance="<node-name>"}
# Memory: node_memory_MemAvailable_bytes{instance="<node-name>"}
# Disk: node_filesystem_avail_bytes{instance="<node-name>"}
Step 4: Check kubelet logs (if SSH access available)
# SSH to node (GKE nodes)
gcloud compute ssh <node-name> --zone=<zone>
# Check kubelet status
sudo systemctl status kubelet
# Check kubelet logs
sudo journalctl -u kubelet --since "30 minutes ago"
Step 5: Check pods on the node
# List pods running on the node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# Check if critical pods are affected
kubectl get pods -n octollm-prod --field-selector spec.nodeName=<node-name>
Remediation Actions
If: Disk pressure (disk full)
# 1. Check disk usage on node
gcloud compute ssh <node-name> --zone=<zone> --command "df -h"
# 2. Identify large files/directories
gcloud compute ssh <node-name> --zone=<zone> --command "du -sh /var/lib/docker/containers/* | sort -rh | head -20"
# 3. Clean up old container logs
gcloud compute ssh <node-name> --zone=<zone> --command "sudo find /var/lib/docker/containers -name '*-json.log' -type f -mtime +7 -delete"
# 4. Clean up unused Docker images
gcloud compute ssh <node-name> --zone=<zone> --command "sudo docker system prune -a -f"
# 5. If still full, cordon and drain the node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 6. Delete and recreate node (GKE auto-repairs)
# Node will be automatically replaced by GKE
If: Memory pressure
# 1. Check memory usage
kubectl top node <node-name>
# 2. Identify memory-hungry pods
kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=memory
# 3. Check if any pods have memory leaks
# Use Grafana to view memory trends over time
# Query: container_memory_usage_bytes{node="<node-name>"}
# 4. Evict non-critical pods to free memory
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# 5. Wait for pods to be rescheduled
kubectl get pods --all-namespaces -o wide | grep <node-name>
# 6. Uncordon node if memory stabilizes
kubectl uncordon <node-name>
# 7. If memory pressure persists, replace node
# Delete node and let GKE auto-repair create new one
If: Network unavailable
# 1. Check network connectivity from node
gcloud compute ssh <node-name> --zone=<zone> --command "ping -c 5 8.8.8.8"
# 2. Check CNI plugin status (GKE uses kubenet or Calico)
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl status kubenet"
# 3. Check for network plugin errors
gcloud compute ssh <node-name> --zone=<zone> --command "sudo journalctl -u kubenet --since '30 minutes ago'"
# 4. Restart network services (risky - only if node is already unusable)
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl restart kubenet"
# 5. If network issue persists, cordon and drain
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# 6. Delete node and let GKE replace it
gcloud compute instances delete <node-name> --zone=<zone>
If: Kubelet not responding
# 1. Check kubelet process
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl status kubelet"
# 2. Restart kubelet
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl restart kubelet"
# 3. Wait 2 minutes and check node status
kubectl get node <node-name>
# 4. If node returns to Ready, uncordon
kubectl uncordon <node-name>
# 5. If kubelet fails to start, check logs
gcloud compute ssh <node-name> --zone=<zone> --command "sudo journalctl -u kubelet -n 100"
# 6. If cannot resolve, cordon, drain, and delete node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
gcloud compute instances delete <node-name> --zone=<zone>
If: Hardware failure (rare in GKE)
# 1. Check for hardware errors in system logs
gcloud compute ssh <node-name> --zone=<zone> --command "dmesg | grep -i error"
# 2. Check for I/O errors
gcloud compute ssh <node-name> --zone=<zone> --command "dmesg | grep -i 'i/o error'"
# 3. Cordon and drain immediately
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# 4. Delete node - GKE will create replacement
gcloud compute instances delete <node-name> --zone=<zone>
# 5. Monitor new node creation
kubectl get nodes -w
Escalation Criteria
Escalate to Senior Engineer if:
- Multiple nodes NotReady simultaneously
- Node cannot be drained (pods stuck in terminating state)
- Network issues affecting entire node pool
Escalate to Engineering Lead if:
-
30% of nodes NotReady
- Node failure pattern suggests cluster-wide issue
- Auto-repair not creating replacement nodes
Escalate to VP Engineering + GCP Support if:
- Complete cluster failure (all nodes NotReady)
- GKE control plane unreachable
- Suspected GCP infrastructure issue
3. HighErrorRate
Alert Definition:
alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
severity: critical
Impact: Users experiencing errors (500, 502, 503, 504). Service availability degraded.
Investigation Steps
Step 1: Identify affected service
# Check error rate in Grafana
# Dashboard: GKE Service Health
# Panel: "Error Rate (5xx) by Service"
# Identify which service has >10% error rate
Step 2: Check recent deployments
# List recent rollouts
kubectl rollout history deployment/<deployment-name> -n <namespace>
# Check when error rate started
# Compare with deployment timestamp in Grafana
Step 3: Analyze error patterns
# Query Loki for error logs
# LogQL: {namespace="<namespace>", service="<service>", level="error"} |= "5xx" | json
# Look for patterns:
# - Specific endpoints failing
# - Common error messages
# - Correlation with other services
Step 4: Check dependencies
# Check if errors are due to downstream dependencies
# Use Jaeger to trace requests
# Navigate to https://jaeger.octollm.dev
# Search for service: <service-name>
# Filter by error status: error=true
# Common dependency issues:
# - Database connection pool exhausted
# - Redis timeout
# - External API rate limiting
# - Inter-service timeout
Step 5: Check resource utilization
# Check if service is resource-constrained
kubectl top pods -n <namespace> -l app=<service>
# Query CPU/memory in Grafana:
# CPU: rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])
# Memory: container_memory_usage_bytes{pod=~"<service>.*"}
Remediation Actions
If: Error rate increased after recent deployment
# 1. Verify deployment timing matches error spike
kubectl rollout history deployment/<deployment-name> -n <namespace>
# 2. Check logs from new pods
kubectl logs -n <namespace> -l app=<service> --tail=100 | grep -i error
# 3. Rollback to previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# 4. Monitor error rate after rollback
# Should decrease within 2-5 minutes
# 5. Verify rollback success
kubectl rollout status deployment/<deployment-name> -n <namespace>
# 6. Create incident ticket with error logs
# Block new deployment until issue is resolved
If: Database connection pool exhausted
# 1. Verify in Grafana
# Query: db_pool_active_connections{service="<service>"} / db_pool_max_connections{service="<service>"}
# 2. Check for connection leaks
# Look for long-running queries in database
# PostgreSQL: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';
# 3. Restart service to clear connections
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 4. If issue persists, increase connection pool size
kubectl edit configmap -n <namespace> <service>-config
# Increase DB_POOL_SIZE (e.g., from 20 to 40)
# 5. Restart to apply new config
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 6. Monitor connection pool usage
# Should stay below 80% of max
If: Downstream service timeout
# 1. Identify failing dependency from Jaeger traces
# Look for spans with error=true and long duration
# 2. Check health of downstream service
kubectl get pods -n <namespace> -l app=<downstream-service>
# 3. Check latency of downstream service
# Grafana query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="<downstream-service>"}[5m]))
# 4. If downstream is slow, scale it up
kubectl scale deployment/<downstream-service> -n <namespace> --replicas=<new-count>
# 5. Increase timeout in calling service (if downstream is legitimately slow)
kubectl edit configmap -n <namespace> <service>-config
# Increase timeout (e.g., from 5s to 10s)
# 6. Restart calling service
kubectl rollout restart deployment/<deployment-name> -n <namespace>
If: External API rate limiting
# 1. Verify in logs
kubectl logs -n <namespace> -l app=<service> | grep -i "rate limit\|429\|too many requests"
# 2. Check rate limit configuration
kubectl get configmap -n <namespace> <service>-config -o yaml | grep -i rate
# 3. Reduce request rate (add caching, implement backoff)
# Short-term: Reduce replica count to lower total requests
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<reduced-count>
# 4. Implement circuit breaker (code change required)
# Long-term fix: Add circuit breaker to prevent cascading failures
# 5. Contact external API provider for rate limit increase
# Document current usage and justification for higher limits
If: Memory leak causing OOM errors
# 1. Identify memory trend in Grafana
# Query: container_memory_usage_bytes{pod=~"<service>.*"}
# Look for steady increase over time
# 2. Restart pods to free memory (temporary fix)
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 3. Increase memory limits (short-term mitigation)
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory
# 4. Enable heap profiling (if supported)
# Add profiling endpoint to service
# Analyze heap dumps to identify leak
# 5. Create high-priority bug ticket
# Attach memory graphs and profiling data
# Assign to owning team
Escalation Criteria
Escalate to Senior Engineer if:
- Error rate >20% for >10 minutes
- Rollback does not resolve issue
- Root cause unclear after 15 minutes of investigation
Escalate to Engineering Lead if:
- Error rate >50% (severe outage)
- Multiple services affected
- Estimated resolution time >1 hour
Escalate to VP Engineering if:
- Complete service outage (100% error rate)
- Customer-reported errors trending on social media
- Revenue-impacting outage
4. DatabaseConnectionPoolExhausted
Alert Definition:
alert: DatabaseConnectionPoolExhausted
expr: db_pool_active_connections / db_pool_max_connections > 0.95
for: 5m
severity: critical
Impact: Services unable to query database. Users experience errors or timeouts.
Investigation Steps
Step 1: Verify pool exhaustion
# Check current pool usage in Grafana
# Query: db_pool_active_connections{service="<service>"} / db_pool_max_connections{service="<service>"}
# Check which service is affected
# Multiple services may share the same database
Step 2: Check for long-running queries
# Connect to database
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm
# List active connections by service
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY application_name;
# List long-running queries (>5 minutes)
SELECT pid, application_name, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < NOW() - INTERVAL '5 minutes'
ORDER BY query_start;
Step 3: Check for connection leaks
# List idle connections
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name;
# If idle count is very high for a service, there's likely a connection leak
# (Idle connections should be returned to pool)
Step 4: Check application logs for connection errors
# Query Loki
# LogQL: {namespace="<namespace>", service="<service>"} |= "connection" |= "error|timeout|exhausted"
# Common error messages:
# - "unable to acquire connection from pool"
# - "connection pool timeout"
# - "too many clients already"
Step 5: Check database resource usage
# Check database CPU/memory
kubectl top pod -n <namespace> <postgres-pod>
# Check database metrics in Grafana
# CPU: rate(container_cpu_usage_seconds_total{pod="<postgres-pod>"}[5m])
# Memory: container_memory_usage_bytes{pod="<postgres-pod>"}
# Disk I/O: rate(container_fs_reads_bytes_total{pod="<postgres-pod>"}[5m])
Remediation Actions
If: Long-running queries blocking connections
# 1. Identify problematic queries
SELECT pid, application_name, query_start, query
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < NOW() - INTERVAL '5 minutes';
# 2. Terminate long-running queries (careful!)
# Only terminate if you're sure it's safe
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <pid>;
# 3. Monitor connection pool recovery
# Check Grafana: pool usage should drop below 95%
# 4. Investigate why queries are slow
# Use EXPLAIN ANALYZE to check query plans
# Look for missing indexes or inefficient joins
# 5. Optimize slow queries (code change)
# Create ticket with slow query details
# Add indexes if needed
If: Connection leak in application
# 1. Identify service with high idle connection count
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name;
# 2. Restart affected service to release connections
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 3. Monitor connection pool after restart
# Usage should drop significantly
# 4. Check application code for connection handling
# Ensure connections are properly closed in finally blocks
# Example (Python):
# try:
# conn = pool.get_connection()
# # Use connection
# finally:
# conn.close() # Must always close!
# 5. Implement connection timeout in pool config
# Add to service ConfigMap:
# DB_POOL_TIMEOUT: 30s
# DB_CONN_MAX_LIFETIME: 1h # Force connection recycling
If: Pool size too small for load
# 1. Check current pool configuration
kubectl get configmap -n <namespace> <service>-config -o yaml | grep DB_POOL
# 2. Calculate required pool size
# Formula: (avg concurrent requests) * (avg query time in seconds) * 1.5
# Example: 100 req/s * 0.1s * 1.5 = 15 connections
# 3. Increase pool size
kubectl edit configmap -n <namespace> <service>-config
# Update DB_POOL_SIZE (e.g., from 20 to 40)
# 4. Verify database can handle more connections
# PostgreSQL max_connections setting (typically 100-200)
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm -c "SHOW max_connections;"
# 5. If database max_connections is too low, increase it
# Edit PostgreSQL ConfigMap or StatefulSet
# Requires database restart
# 6. Restart service to use new pool size
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 7. Monitor pool usage
# Target: <80% utilization under normal load
If: Database is resource-constrained
# 1. Check database CPU/memory
kubectl top pod -n <namespace> <postgres-pod>
# 2. If database CPU >80%, check for expensive queries
# Connect to database
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm
# Find most expensive queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
# 3. If database memory >90%, increase memory limits
kubectl edit statefulset -n <namespace> postgres
# Increase resources.limits.memory
# 4. If database disk I/O high, consider:
# - Adding indexes to reduce table scans
# - Increasing disk IOPS (resize persistent disk)
# - Enabling query result caching
# 5. Scale database vertically (larger instance)
# For managed databases (Cloud SQL), increase machine type
# For self-hosted, increase resource limits and restart
If: Too many services connecting to same database
# 1. Identify which services are using most connections
SELECT application_name, COUNT(*), MAX(query_start)
FROM pg_stat_activity
GROUP BY application_name
ORDER BY COUNT(*) DESC;
# 2. Implement connection pooling at database level
# Deploy PgBouncer between services and database
# PgBouncer multiplexes connections, reducing load on database
# 3. Configure PgBouncer
# pool_mode: transaction (default) or session
# max_client_conn: 1000 (much higher than database limit)
# default_pool_size: 20 (connections to actual database per pool)
# 4. Update service connection strings to point to PgBouncer
kubectl edit configmap -n <namespace> <service>-config
# Change DB_HOST from postgres:5432 to pgbouncer:6432
# 5. Restart services
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 6. Monitor PgBouncer metrics
# Check connection multiplexing ratio
Escalation Criteria
Escalate to Senior Engineer if:
- Pool exhaustion persists after restarting services
- Cannot identify source of connection leak
- Database max_connections needs to be increased significantly
Escalate to Database Admin if:
- Database CPU/memory consistently >90%
- Slow queries cannot be optimized with indexes
- Need to implement replication or sharding
Escalate to Engineering Lead if:
- Database outage suspected
- Need to migrate to larger database instance
- Estimated resolution time >1 hour
5. HighLatency
Alert Definition:
alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
severity: critical
Impact: Slow response times for users. Degraded user experience. Possible timeout errors.
Investigation Steps
Step 1: Identify affected service and endpoints
# Check latency by service in Grafana
# Dashboard: GKE Service Health
# Panel: "Request Latency (P50/P95/P99)"
# Identify which service has P95 >1s
# Check latency by endpoint
# Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="<service>"}[5m])) by (handler)
Step 2: Check for recent changes
# List recent deployments
kubectl rollout history deployment/<deployment-name> -n <namespace>
# Check when latency increased
# Compare with deployment timestamp in Grafana
Step 3: Analyze slow requests with Jaeger
# Navigate to https://jaeger.octollm.dev
# 1. Search for service: <service-name>
# 2. Filter by min duration: >1s
# 3. Sort by longest duration
# 4. Click on slowest trace to see span breakdown
# Look for:
# - Which span is slowest (database query, external API call, internal processing)
# - Spans with errors
# - Multiple spans to same service (N+1 query problem)
Step 4: Check resource utilization
# Check if service is CPU-constrained
kubectl top pods -n <namespace> -l app=<service>
# Query CPU in Grafana:
# rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])
# If CPU near limit, service may be throttled
Step 5: Check dependencies
# Check if downstream services are slow
# Use Jaeger to identify which dependency is slow
# Check database query performance
# Connect to database and check slow query log
# Check cache hit rate (Redis)
# Grafana query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
Remediation Actions
If: Slow database queries
# 1. Identify slow queries from Jaeger traces
# Look for database spans with duration >500ms
# 2. Connect to database and analyze query
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm
# 3. Use EXPLAIN ANALYZE to check query plan
EXPLAIN ANALYZE <slow-query>;
# 4. Look for sequential scans (bad - should use index)
# Look for "Seq Scan on <table>" in output
# 5. Create missing indexes
CREATE INDEX CONCURRENTLY idx_<table>_<column> ON <table>(<column>);
# CONCURRENTLY allows index creation without locking table
# 6. Monitor query performance after index creation
# Should see immediate improvement in latency
# 7. Update query to use index (if optimizer doesn't automatically)
# Sometimes need to rewrite query to use indexed columns
If: Low cache hit rate
# 1. Check cache hit rate in Grafana
# Query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
# Target: >80% hit rate
# 2. Check cache size
kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO memory
# 3. If cache is too small, increase memory
kubectl edit statefulset -n <namespace> redis
# Increase resources.limits.memory
# 4. Check cache TTL settings
# If TTL too short, increase it
kubectl get configmap -n <namespace> <service>-config -o yaml | grep CACHE_TTL
# 5. Increase cache TTL
kubectl edit configmap -n <namespace> <service>-config
# CACHE_TTL: 600s → 1800s (10m → 30m)
# 6. Restart service to use new TTL
kubectl rollout restart deployment/<deployment-name> -n <namespace>
# 7. Consider implementing cache warming
# Pre-populate cache with frequently accessed data
If: CPU-constrained (throttled)
# 1. Check CPU usage in Grafana
# Query: rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])
# Compare with CPU limit
# 2. If usage near limit, increase CPU allocation
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.cpu (e.g., from 500m to 1000m)
# 3. Monitor latency after change
# Should improve within 2-5 minutes
# 4. If latency persists, consider horizontal scaling
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>
# 5. Enable HPA for automatic scaling
kubectl autoscale deployment/<deployment-name> -n <namespace> \
--cpu-percent=70 \
--min=2 \
--max=10
If: External API slow
# 1. Identify slow external API from Jaeger
# Look for HTTP client spans with long duration
# 2. Check if external API has status page
# Navigate to status page (e.g., status.openai.com)
# 3. Implement timeout and circuit breaker
# Prevent one slow API from blocking all requests
# Example circuit breaker config:
# - Failure threshold: 50%
# - Timeout: 5s
# - Cool-down period: 30s
# 4. Add caching for external API responses
# Cache responses for 5-15 minutes if data doesn't change frequently
# 5. Implement fallback mechanism
# Return cached/default data if external API is slow
# Example: Use stale cache data if API timeout
# 6. Contact external API provider
# Request status update or escalation
If: N+1 query problem
# 1. Identify N+1 pattern in Jaeger
# Multiple sequential database queries in a loop
# Example: 1 query to get list + N queries to get details
# 2. Check application code
# Look for loops that execute queries
# Example (bad):
# users = fetch_users()
# for user in users:
# user.posts = fetch_posts(user.id) # N queries!
# 3. Implement eager loading / batch fetching
# Fetch all related data in one query
# Example (good):
# users = fetch_users_with_posts() # Single join query
# 4. Deploy fix and verify
# Check Jaeger - should see single query instead of N+1
# 5. Monitor latency improvement
# Should see significant reduction in P95/P99 latency
If: Latency increased after deployment
# 1. Verify timing correlation
kubectl rollout history deployment/<deployment-name> -n <namespace>
# 2. Check recent code changes
git log --oneline --since="2 hours ago"
# 3. Rollback deployment
kubectl rollout undo deployment/<deployment-name> -n <namespace>
# 4. Verify latency returns to normal
# Check Grafana - should improve within 5 minutes
# 5. Create incident ticket with details
# - Deployment that caused regression
# - Latency metrics before/after
# - Affected endpoints
# 6. Block deployment until fix is available
# Review code changes to identify performance regression
Escalation Criteria
Escalate to Senior Engineer if:
- Latency >2s (P95) for >15 minutes
- Root cause not identified within 20 minutes
- Rollback does not resolve issue
Escalate to Database Admin if:
- Database queries slow despite proper indexes
- Need to optimize database configuration
- Considering read replicas or sharding
Escalate to Engineering Lead if:
- Latency affecting multiple services
- Need architectural changes (caching layer, async processing)
- Customer complaints or revenue impact
6. CertificateExpiringInSevenDays
Alert Definition:
alert: CertificateExpiringInSevenDays
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) < 604800
for: 1h
severity: critical
Impact: If certificate expires, users will see TLS errors and cannot access services via HTTPS.
Investigation Steps
Step 1: Identify expiring certificate
# List all certificates
kubectl get certificate --all-namespaces
# Check expiring certificates
kubectl get certificate --all-namespaces -o json | \
jq -r '.items[] | select(.status.notAfter != null) |
[.metadata.namespace, .metadata.name, .status.notAfter] | @tsv'
# Example output:
# octollm-monitoring grafana-tls-cert 2025-12-05T10:30:00Z
# octollm-prod api-tls-cert 2025-12-12T14:20:00Z
Step 2: Check certificate status
kubectl describe certificate -n <namespace> <cert-name>
# Look for:
# Status: Ready
# Renewal Time: (should be set)
# Events: Check for renewal attempts
Step 3: Check cert-manager logs
# Get cert-manager controller pod
kubectl get pods -n cert-manager
# Check logs for renewal attempts
kubectl logs -n cert-manager <cert-manager-pod> | grep <cert-name>
# Look for errors:
# - "rate limit exceeded" (Let's Encrypt)
# - "challenge failed" (DNS/HTTP validation failed)
# - "unable to connect to ACME server"
Step 4: Check ClusterIssuer status
# List ClusterIssuers
kubectl get clusterissuer
# Check issuer details
kubectl describe clusterissuer letsencrypt-prod
# Look for:
# Status: Ready
# ACME account registered: True
Step 5: Check DNS/Ingress for challenge
# For DNS-01 challenge (wildcard certs)
# Verify DNS provider credentials are valid
kubectl get secret -n cert-manager <dns-provider-secret>
# For HTTP-01 challenge
# Verify ingress is accessible
curl -I https://<domain>/.well-known/acme-challenge/test
Remediation Actions
If: Certificate not auto-renewing (cert-manager issue)
# 1. Check cert-manager is running
kubectl get pods -n cert-manager
# 2. If pods are not running, check for issues
kubectl describe pods -n cert-manager <cert-manager-pod>
# 3. Restart cert-manager if needed
kubectl rollout restart deployment -n cert-manager cert-manager
kubectl rollout restart deployment -n cert-manager cert-manager-webhook
kubectl rollout restart deployment -n cert-manager cert-manager-cainjector
# 4. Wait for cert-manager to be ready
kubectl wait --for=condition=ready pod -n cert-manager -l app=cert-manager --timeout=2m
# 5. Trigger manual renewal
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)
# 6. Check renewal progress
kubectl describe certificate -n <namespace> <cert-name>
# 7. Monitor events for successful renewal
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i certificate
If: Let's Encrypt rate limit exceeded
# 1. Check error message in cert-manager logs
kubectl logs -n cert-manager <cert-manager-pod> | grep "rate limit"
# Error example: "too many certificates already issued for: octollm.dev"
# 2. Let's Encrypt limits:
# - 50 certificates per registered domain per week
# - 5 duplicate certificates per week
# 3. Wait for rate limit to reset (1 week)
# No immediate fix - must wait
# 4. Temporary workaround: Use staging issuer
kubectl edit certificate -n <namespace> <cert-name>
# Change issuerRef.name: letsencrypt-prod → letsencrypt-staging
# 5. Staging cert will be issued (browsers will show warning)
# Acceptable for dev/staging, not for prod
# 6. For prod: Request rate limit increase from Let's Encrypt
# Email: limit-increases@letsencrypt.org
# Provide: domain, business justification, expected cert volume
# 7. Long-term: Reduce cert renewals
# Use wildcard certificates to cover multiple subdomains
# Increase cert lifetime (Let's Encrypt is 90 days, cannot change)
If: DNS challenge failing (DNS-01)
# 1. Check DNS provider credentials
kubectl get secret -n cert-manager <dns-provider-secret> -o yaml
# 2. Verify secret has correct keys
# For Google Cloud DNS:
# - key.json (service account key)
# For Cloudflare:
# - api-token
# 3. Test DNS provider access manually
# For Google Cloud DNS:
gcloud dns record-sets list --zone=<zone-name>
# For Cloudflare:
curl -X GET "https://api.cloudflare.com/client/v4/zones" \
-H "Authorization: Bearer <token>"
# 4. If credentials are invalid, update secret
kubectl delete secret -n cert-manager <dns-provider-secret>
kubectl create secret generic -n cert-manager <dns-provider-secret> \
--from-file=key.json=<path-to-new-key>
# 5. Restart cert-manager to pick up new credentials
kubectl rollout restart deployment -n cert-manager cert-manager
# 6. Trigger certificate renewal
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)
# 7. Check certificate status
kubectl describe certificate -n <namespace> <cert-name>
If: HTTP challenge failing (HTTP-01)
# 1. Check if ingress is accessible
curl -I https://<domain>/.well-known/acme-challenge/test
# 2. Verify ingress controller is running
kubectl get pods -n ingress-nginx # or kube-system for GKE
# 3. Check if challenge path is reachable
kubectl get ingress -n <namespace>
# 4. Check ingress events
kubectl describe ingress -n <namespace> <ingress-name>
# 5. Verify DNS points to correct load balancer
nslookup <domain>
# Should resolve to ingress load balancer IP
# 6. Check firewall rules allow HTTP (port 80)
# Let's Encrypt requires HTTP for challenge, even for HTTPS certs
gcloud compute firewall-rules list --filter="name~'.*allow-http.*'"
# 7. If firewall blocks HTTP, create allow rule
gcloud compute firewall-rules create allow-http \
--allow tcp:80 \
--source-ranges 0.0.0.0/0
# 8. Retry certificate issuance
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)
If: Manual certificate renewal needed (last resort)
# 1. Generate new certificate manually with certbot
certbot certonly --manual --preferred-challenges dns \
-d <domain> -d *.<domain>
# 2. Update DNS TXT record as instructed by certbot
# Wait for DNS propagation (1-5 minutes)
# 3. Complete certbot challenge
# Certbot will save certificate to /etc/letsencrypt/live/<domain>/
# 4. Create Kubernetes secret with new certificate
kubectl create secret tls <cert-name> -n <namespace> \
--cert=/etc/letsencrypt/live/<domain>/fullchain.pem \
--key=/etc/letsencrypt/live/<domain>/privkey.pem
# 5. Update ingress to use new secret
kubectl edit ingress -n <namespace> <ingress-name>
# Verify spec.tls[].secretName matches new secret name
# 6. Verify HTTPS is working
curl -I https://<domain>
# 7. Fix cert-manager issue to prevent manual renewals in future
# This is a temporary workaround only!
Escalation Criteria
Escalate to Senior Engineer if:
- Certificate expires in <3 days and not renewing
- cert-manager issues persist after restart
- DNS provider integration broken
Escalate to Engineering Lead if:
- Certificate expires in <24 hours
- Multiple certificates failing to renew
- Need to switch certificate provider
Escalate to VP Engineering + Legal if:
- Production certificate expired (causing outage)
- Customer data exposure risk due to TLS issues
- Need to purchase commercial certificates (e.g., DigiCert)
Warning Alert Procedures
7. HighNodeCPUUsage
Alert Definition:
alert: HighNodeCPUUsage
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) > 0.80
for: 10m
severity: warning
Impact: Node under high load. May affect performance. Pods may be throttled.
Investigation Steps
- Identify affected node
kubectl top nodes
- Check pod CPU usage on the node
kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=cpu
- Check for CPU-intensive processes
# Use metrics in Grafana
# Query: topk(10, rate(container_cpu_usage_seconds_total{node="<node-name>"}[5m]))
Remediation Actions
Option 1: Scale application horizontally
# Add more replicas to distribute load
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>
# Or enable HPA
kubectl autoscale deployment/<deployment-name> -n <namespace> \
--cpu-percent=70 --min=2 --max=10
Option 2: Increase node CPU limits
# Edit deployment to increase CPU limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.cpu
Option 3: Add more nodes to cluster
# For GKE, resize node pool
gcloud container clusters resize <cluster-name> \
--node-pool=<pool-name> \
--num-nodes=<new-count> \
--zone=<zone>
Escalation Criteria
- Escalate if CPU >90% for >30 minutes
- Escalate if performance degradation reported by users
8. HighNodeMemoryUsage
Alert Definition:
alert: HighNodeMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
for: 10m
severity: warning
Impact: Node running out of memory. May trigger OOM kills.
Investigation Steps
- Identify affected node
kubectl top nodes
- Check pod memory usage on the node
kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=memory
- Check for memory leaks
# Use Grafana to view memory trends
# Query: container_memory_usage_bytes{node="<node-name>"}
# Look for steadily increasing memory over time
Remediation Actions
Option 1: Restart memory-leaking pods
kubectl delete pod -n <namespace> <pod-name>
# Or rollout restart
kubectl rollout restart deployment/<deployment-name> -n <namespace>
Option 2: Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory
Option 3: Scale horizontally
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>
Escalation Criteria
- Escalate if memory >95% for >15 minutes
- Escalate if OOMKilled events detected
9. HighRequestLatency
Alert Definition:
alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
severity: warning
Impact: Slow responses. Users experiencing delays.
See detailed procedure in Critical Alert #5 (HighLatency) - same investigation and remediation steps apply.
10. PodOOMKilled
Alert Definition:
alert: PodOOMKilled
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
for: 1m
severity: warning
Impact: Container killed due to out-of-memory. Service may be unavailable briefly.
Investigation Steps
- Identify OOMKilled pod
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") |
[.metadata.namespace, .metadata.name] | @tsv'
- Check memory limits
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].resources}'
- Check memory usage before OOM
# Query in Grafana:
# container_memory_usage_bytes{pod="<pod-name>"}
Remediation Actions
Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory (e.g., 512Mi → 1Gi)
Check for memory leaks
# If memory increases steadily over time, likely a leak
# Enable heap profiling and investigate
Escalation Criteria
- Escalate if OOMKilled repeatedly (>3 times in 1 hour)
- Escalate if memory leak suspected
11. PersistentVolumeClaimPending
Alert Definition:
alert: PersistentVolumeClaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 5m
severity: warning
Impact: Pod cannot start due to unbound PVC. Service may be unavailable.
Investigation Steps
- Identify pending PVC
kubectl get pvc --all-namespaces | grep Pending
- Check PVC details
kubectl describe pvc -n <namespace> <pvc-name>
- Check storage class
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
Remediation Actions
If: No storage class exists
# Create storage class (example for GKE)
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
EOF
# Update PVC to use storage class
kubectl edit pvc -n <namespace> <pvc-name>
# Set storageClassName: fast-ssd
If: Storage quota exceeded
# Check quota
kubectl get resourcequota -n <namespace>
# Increase quota if needed
kubectl edit resourcequota -n <namespace> <quota-name>
If: Node affinity preventing binding
# Check if PV has node affinity that doesn't match any node
kubectl get pv | grep Available
kubectl describe pv <pv-name>
# May need to delete PV and recreate without affinity
Escalation Criteria
- Escalate if PVC pending for >15 minutes
- Escalate if quota increase needed
12. DeploymentReplicasMismatch
Alert Definition:
alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 15m
severity: warning
Impact: Deployment not at desired replica count. May affect availability or capacity.
Investigation Steps
- Identify affected deployment
kubectl get deployments --all-namespaces
# Look for deployments where READY != DESIRED
- Check pod status
kubectl get pods -n <namespace> -l app=<deployment-name>
- Check for pod errors
kubectl describe pod -n <namespace> <pod-name>
Remediation Actions
If: Pods pending due to resources
# Check pending reason
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 Events
# If "Insufficient cpu" or "Insufficient memory":
# - Add more nodes, or
# - Reduce resource requests
If: Image pull error
# Fix image name or credentials
kubectl set image deployment/<deployment-name> <container>=<correct-image> -n <namespace>
If: Pods crashing
# See PodCrashLoopBackOff procedure (Critical Alert #1)
Escalation Criteria
- Escalate if mismatch persists for >30 minutes
- Escalate if related to resource capacity issues
13. LowCacheHitRate
Alert Definition:
alert: LowCacheHitRate
expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.50
for: 15m
severity: warning
Impact: Increased latency and load on database due to cache misses.
Investigation Steps
- Check cache hit rate in Grafana
# Query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
- Check cache size and memory
kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO memory
- Check cache eviction rate
kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO stats | grep evicted_keys
Remediation Actions
If: Cache too small (frequent evictions)
# Increase Redis memory
kubectl edit statefulset -n <namespace> redis
# Increase resources.limits.memory
# Restart Redis
kubectl delete pod -n <namespace> <redis-pod>
If: Cache TTL too short
# Increase TTL in application config
kubectl edit configmap -n <namespace> <service>-config
# Increase CACHE_TTL value
# Restart service
kubectl rollout restart deployment/<deployment-name> -n <namespace>
If: Data access patterns changed
# Implement cache warming
# Pre-populate cache with frequently accessed data
# Adjust cache strategy (e.g., cache-aside vs. write-through)
Escalation Criteria
- Escalate if hit rate <30% for >1 hour
- Escalate if causing user-facing latency issues
Informational Alert Procedures
14. NewDeploymentDetected
Alert Definition:
alert: NewDeploymentDetected
expr: changes(kube_deployment_status_observed_generation[5m]) > 0
severity: info
Impact: Informational. No immediate action required.
Actions
- Verify deployment in kubectl
kubectl rollout status deployment/<deployment-name> -n <namespace>
- Monitor for related alerts (errors, crashes, latency)
# Check Alertmanager for any new critical/warning alerts
- Document in change log if significant deployment
15. HPAScaledUp / HPAScaledDown
Alert Definition:
alert: HPAScaledUp
expr: changes(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0
severity: info
Impact: Informational. HPA adjusted replica count based on load.
Actions
- Verify scaling event in Grafana
# Query: kube_horizontalpodautoscaler_status_current_replicas{hpa="<hpa-name>"}
-
Check if scaling is expected (e.g., during peak hours)
-
If scaling too frequent, adjust HPA thresholds:
kubectl edit hpa -n <namespace> <hpa-name>
# Adjust targetCPUUtilizationPercentage
16. ConfigMapChanged
Alert Definition:
alert: ConfigMapChanged
expr: changes(kube_configmap_info[5m]) > 0
severity: info
Impact: Informational. ConfigMap updated.
Actions
- Identify changed ConfigMap
kubectl get configmap --all-namespaces --sort-by=.metadata.creationTimestamp
-
Verify change was intentional
-
Restart pods if needed to pick up new config:
kubectl rollout restart deployment/<deployment-name> -n <namespace>
Multi-Alert Scenarios
Scenario 1: Multiple Pods Crashing + Node NotReady
Symptoms:
- Alert: PodCrashLoopBackOff (multiple pods)
- Alert: NodeNotReady (1 node)
Root Cause: Node failure causing all pods on that node to crash.
Investigation:
- Identify which pods are on the failing node
- Check node status (see NodeNotReady procedure)
Remediation:
- Cordon and drain the failing node
- Pods will be rescheduled to healthy nodes
- Replace the failed node
Scenario 2: High Error Rate + Database Connection Pool Exhausted
Symptoms:
- Alert: HighErrorRate (>10% 5xx errors)
- Alert: DatabaseConnectionPoolExhausted (>95% pool usage)
Root Cause: Connection pool exhaustion causing service errors.
Investigation:
- Check if error rate corresponds to pool exhaustion timing
- Check for long-running database queries
Remediation:
- Restart service to release connections
- Increase connection pool size
- Optimize slow queries
Scenario 3: High Latency + Low Cache Hit Rate + High Database Load
Symptoms:
- Alert: HighLatency (P95 >1s)
- Alert: LowCacheHitRate (<50%)
- Observation: High database CPU
Root Cause: Cache ineffectiveness causing excessive database load and slow queries.
Investigation:
- Check cache hit rate timeline
- Check database query volume
- Identify cache misses by key pattern
Remediation:
- Increase cache size
- Increase cache TTL
- Implement cache warming for common queries
- Add database indexes for frequent queries
Escalation Decision Trees
Decision Tree 1: Service Outage
Service completely unavailable (100% error rate)?
├─ YES → CRITICAL - Page on-call engineer
│ ├─ Multiple services down?
│ │ ├─ YES → Page Engineering Lead + VP Eng
│ │ └─ NO → Continue troubleshooting
│ └─ Customer-reported on social media?
│ ├─ YES → Notify VP Eng + Customer Success
│ └─ NO → Continue troubleshooting
└─ NO → Check error rate
├─ >50% error rate?
│ ├─ YES → Page on-call engineer
│ └─ NO → Assign to on-call engineer (Slack)
└─ <10% error rate?
└─ YES → Create ticket, no immediate page
Decision Tree 2: Performance Degradation
Users reporting slow performance?
├─ YES → Check latency metrics
│ ├─ P95 >2s?
│ │ ├─ YES → CRITICAL - Page on-call engineer
│ │ └─ NO → Assign to on-call engineer
│ └─ P95 >1s but <2s?
│ ├─ YES → WARNING - Notify on-call engineer (Slack)
│ └─ NO → Create ticket for investigation
└─ NO → Proactive monitoring
└─ P95 >1s for >15m?
├─ YES → Investigate proactively
└─ NO → Continue monitoring
Decision Tree 3: Infrastructure Issue
Node or infrastructure alert?
├─ NodeNotReady?
│ ├─ Single node?
│ │ ├─ YES → Cordon, drain, replace
│ │ └─ NO → Multiple nodes - Page Engineering Lead
│ └─ >30% of nodes affected?
│ └─ YES → CRITICAL - Page VP Eng + GCP Support
└─ Disk/Memory pressure?
├─ Can be resolved with cleanup?
│ ├─ YES → Clean up and monitor
│ └─ NO → Page on-call engineer for node replacement
Post-Incident Actions
After Resolving Critical Alerts
-
Document resolution in incident tracker
- Root cause
- Actions taken
- Time to resolution
- Services affected
-
Create post-incident review (PIR) for critical incidents
- Timeline of events
- Impact assessment
- Contributing factors
- Action items to prevent recurrence
-
Update runbooks if new issue discovered
- Add new troubleshooting steps
- Update remediation procedures
- Document lessons learned
-
Implement preventive measures
- Add monitoring for early detection
- Improve alerting thresholds
- Automate remediation where possible
-
Communicate to stakeholders
- Internal: Engineering team, leadership
- External: Customers (if user-impacting)
- Status page update
Post-Incident Review Template
# Post-Incident Review: <Incident Title>
**Date**: YYYY-MM-DD
**Severity**: Critical / Warning
**Duration**: X hours Y minutes
**Services Affected**: <list>
## Summary
<1-2 sentence summary of incident>
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:00 | Alert triggered: HighErrorRate |
| 14:05 | On-call engineer acknowledged |
| 14:10 | Root cause identified: database connection pool exhausted |
| 14:15 | Mitigation applied: restarted service |
| 14:20 | Incident resolved: error rate returned to normal |
## Root Cause
<Detailed explanation of what caused the incident>
## Impact
- **User Impact**: X% of requests resulted in errors
- **Revenue Impact**: $Y estimated lost revenue
- **Duration**: X hours Y minutes
## Resolution
<What was done to resolve the incident>
## Contributing Factors
1. Factor 1
2. Factor 2
## Action Items
1. [ ] Increase connection pool size (Owner: @engineer, Due: YYYY-MM-DD)
2. [ ] Add alert for connection pool usage (Owner: @engineer, Due: YYYY-MM-DD)
3. [ ] Update runbook with new procedure (Owner: @engineer, Due: YYYY-MM-DD)
## Lessons Learned
- What went well
- What could be improved
- What we learned
Summary
This alert response procedures document provides detailed, step-by-step guidance for responding to all alerts in the OctoLLM monitoring system. Key points:
- Critical alerts require immediate action (acknowledge within 5 minutes, resolve within 1 hour)
- Warning alerts require timely action (acknowledge within 30 minutes, resolve within 4 hours)
- Info alerts are informational and require no immediate action
Each procedure includes:
- Alert definition and impact
- Investigation steps with commands
- Remediation actions with code examples
- Escalation criteria
For all incidents:
- Follow the general response workflow (acknowledge → assess → investigate → remediate → document → close)
- Use the escalation decision trees to determine when to involve senior engineers or leadership
- Complete post-incident reviews for critical incidents
- Update runbooks with lessons learned
Related Documents:
- Monitoring Runbook:
/home/parobek/Code/OctoLLM/docs/operations/monitoring-runbook.md - Deployment Guide:
/home/parobek/Code/OctoLLM/docs/deployment-guide.md - Backup and Restore:
/home/parobek/Code/OctoLLM/docs/operations/backup-restore.md