Alert Response Procedures

Document Version: 1.0.0 Last Updated: 2025-11-12 Owner: OctoLLM Operations Team Status: Production

Overview
Response Workflow
Critical Alert Procedures
Warning Alert Procedures
Informational Alert Procedures
Multi-Alert Scenarios
Escalation Decision Trees
Post-Incident Actions

Overview

This document provides step-by-step procedures for responding to alerts from the OctoLLM monitoring system. Each procedure includes:

Detection: How the alert is triggered
Impact: What this means for users and the system
Investigation Steps: How to diagnose the issue
Remediation Actions: How to fix the problem
Escalation Criteria: When to involve senior engineers or management

Alert Severity Levels:

Critical: Immediate action required, user-impacting, PagerDuty notification
Warning: Action required within 1 hour, potential user impact, Slack notification
Info: No immediate action required, informational only, logged to Slack

Response Time SLAs:

Critical: Acknowledge within 5 minutes, resolve within 1 hour
Warning: Acknowledge within 30 minutes, resolve within 4 hours
Info: Review within 24 hours

Response Workflow

General Alert Response Process

1. ACKNOWLEDGE
   └─> Acknowledge alert in PagerDuty/Slack
   └─> Note start time in incident tracker

2. ASSESS
   └─> Check alert details (service, namespace, severity)
   └─> Review recent deployments or changes
   └─> Check for related alerts

3. INVESTIGATE
   └─> Follow specific alert procedure (see sections below)
   └─> Gather logs, metrics, traces
   └─> Identify root cause

4. REMEDIATE
   └─> Apply fix (restart, scale, rollback, etc.)
   └─> Verify fix with metrics/logs
   └─> Monitor for 10-15 minutes

5. DOCUMENT
   └─> Update incident tracker with resolution
   └─> Create post-incident review if critical
   └─> Update runbooks if new issue discovered

6. CLOSE
   └─> Resolve alert in PagerDuty/Slack
   └─> Confirm no related alerts remain

Tools Quick Reference

Grafana: https://grafana.octollm.dev
Prometheus: https://prometheus.octollm.dev
Jaeger: https://jaeger.octollm.dev
Alertmanager: https://alertmanager.octollm.dev
kubectl: CLI access to Kubernetes cluster

Critical Alert Procedures

1. PodCrashLoopBackOff

Alert Definition:

alert: PodCrashLoopBackOff
expr: rate(kube_pod_container_status_restarts_total{namespace=~"octollm.*"}[10m]) > 0.3
for: 5m
severity: critical

Impact: Service degradation or complete outage. Users may experience errors or timeouts.

Investigation Steps

Step 1: Identify the crashing pod

# List pods with high restart counts
kubectl get pods -n <namespace> --sort-by=.status.containerStatuses[0].restartCount

# Example output:
# NAME                          READY   STATUS             RESTARTS   AGE
# orchestrator-7d9f8c-xk2p9     0/1     CrashLoopBackOff   12         30m

Step 2: Check pod logs

# Get recent logs from crashing container
kubectl logs -n <namespace> <pod-name> --tail=100

# Get logs from previous container instance
kubectl logs -n <namespace> <pod-name> --previous

# Common error patterns:
# - "Connection refused" → Dependency unavailable
# - "Out of memory" → Resource limits too low
# - "Panic: runtime error" → Code bug
# - "Permission denied" → RBAC or volume mount issue

Step 3: Check pod events

kubectl describe pod -n <namespace> <pod-name>

# Look for events like:
# - "Back-off restarting failed container"
# - "Error: ErrImagePull"
# - "FailedMount"
# - "OOMKilled"

Step 4: Check resource usage

# Check if pod is OOMKilled
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

# Check resource requests/limits
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Step 5: Check configuration

# Verify environment variables
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].env}'

# Check ConfigMap/Secret mounts
kubectl describe configmap -n <namespace> <configmap-name>
kubectl describe secret -n <namespace> <secret-name>

Remediation Actions

If: Connection refused to dependency (DB, Redis, etc.)

# 1. Check if dependency service is healthy
kubectl get pods -n <namespace> -l app=<dependency>

# 2. Test connectivity from within cluster
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod: nc -zv <service-name> <port>

# 3. Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# 4. If dependency is down, restart it first
kubectl rollout restart deployment/<dependency-name> -n <namespace>

# 5. Wait for dependency to be ready, then restart affected pod
kubectl delete pod -n <namespace> <pod-name>

If: Out of memory (OOMKilled)

# 1. Check current memory usage in Grafana
# Query: container_memory_usage_bytes{pod="<pod-name>"}

# 2. Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory (e.g., from 512Mi to 1Gi)

# 3. Monitor memory usage after restart

If: Image pull error

# 1. Check image name and tag
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].image}'

# 2. Verify image exists in registry
gcloud container images list --repository=gcr.io/<project-id>

# 3. Check image pull secrets
kubectl get secrets -n <namespace> | grep gcr

# 4. If image is wrong, update deployment
kubectl set image deployment/<deployment-name> <container-name>=<correct-image> -n <namespace>

If: Configuration error

# 1. Validate ConfigMap/Secret exists and has correct data
kubectl get configmap -n <namespace> <configmap-name> -o yaml

# 2. If config is wrong, update it
kubectl edit configmap -n <namespace> <configmap-name>

# 3. Restart pods to pick up new config
kubectl rollout restart deployment/<deployment-name> -n <namespace>

If: Code bug (panic, runtime error)

# 1. Check Jaeger for traces showing error
# Navigate to https://jaeger.octollm.dev
# Search for service: <service-name>, operation: <failing-operation>

# 2. Identify commit that introduced bug
kubectl get deployment -n <namespace> <deployment-name> -o jsonpath='{.spec.template.spec.containers[0].image}'

# 3. Rollback to previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# 4. Verify rollback
kubectl rollout status deployment/<deployment-name> -n <namespace>

# 5. Create incident ticket with logs/traces
# Subject: "CrashLoopBackOff in <service> due to <error>"
# Include: logs, traces, reproduction steps

If: Persistent volume mount failure

# 1. Check PVC status
kubectl get pvc -n <namespace>

# 2. Check PVC events
kubectl describe pvc -n <namespace> <pvc-name>

# 3. If PVC is pending, check storage class
kubectl get storageclass

# 4. If PVC is lost, restore from backup (see backup-restore.md)

Escalation Criteria

Escalate to Senior Engineer if:

Root cause not identified within 15 minutes
Multiple pods crashing across different services
Rollback does not resolve the issue
Data loss suspected

Escalate to Engineering Lead if:

Critical service (orchestrator, reflex-layer) down for >30 minutes
Root cause requires code fix (cannot be resolved via config/restart)

Escalate to VP Engineering if:

Complete outage (all services down)
Data corruption suspected
Estimated resolution time >2 hours

2. NodeNotReady

Alert Definition:

alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="false"} == 1
for: 5m
severity: critical

Impact: Reduced cluster capacity. Pods on the node are evicted and rescheduled. Possible service degradation.

Investigation Steps

Step 1: Identify unhealthy node

# List all nodes with status
kubectl get nodes -o wide

# Example output:
# NAME                     STATUS     ROLES    AGE   VERSION
# gke-cluster-pool-1-abc   Ready      <none>   10d   v1.28.3
# gke-cluster-pool-1-def   NotReady   <none>   10d   v1.28.3  ← Problem node

Step 2: Check node conditions

kubectl describe node <node-name>

# Look for conditions:
# - Ready: False
# - MemoryPressure: True
# - DiskPressure: True
# - PIDPressure: True
# - NetworkUnavailable: True

Step 3: Check node resource usage

# Check node metrics
kubectl top node <node-name>

# Query in Grafana:
# CPU: node_cpu_seconds_total{instance="<node-name>"}
# Memory: node_memory_MemAvailable_bytes{instance="<node-name>"}
# Disk: node_filesystem_avail_bytes{instance="<node-name>"}

Step 4: Check kubelet logs (if SSH access available)

# SSH to node (GKE nodes)
gcloud compute ssh <node-name> --zone=<zone>

# Check kubelet status
sudo systemctl status kubelet

# Check kubelet logs
sudo journalctl -u kubelet --since "30 minutes ago"

Step 5: Check pods on the node

# List pods running on the node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Check if critical pods are affected
kubectl get pods -n octollm-prod --field-selector spec.nodeName=<node-name>

Remediation Actions

If: Disk pressure (disk full)

# 1. Check disk usage on node
gcloud compute ssh <node-name> --zone=<zone> --command "df -h"

# 2. Identify large files/directories
gcloud compute ssh <node-name> --zone=<zone> --command "du -sh /var/lib/docker/containers/* | sort -rh | head -20"

# 3. Clean up old container logs
gcloud compute ssh <node-name> --zone=<zone> --command "sudo find /var/lib/docker/containers -name '*-json.log' -type f -mtime +7 -delete"

# 4. Clean up unused Docker images
gcloud compute ssh <node-name> --zone=<zone> --command "sudo docker system prune -a -f"

# 5. If still full, cordon and drain the node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# 6. Delete and recreate node (GKE auto-repairs)
# Node will be automatically replaced by GKE

If: Memory pressure

# 1. Check memory usage
kubectl top node <node-name>

# 2. Identify memory-hungry pods
kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=memory

# 3. Check if any pods have memory leaks
# Use Grafana to view memory trends over time
# Query: container_memory_usage_bytes{node="<node-name>"}

# 4. Evict non-critical pods to free memory
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force

# 5. Wait for pods to be rescheduled
kubectl get pods --all-namespaces -o wide | grep <node-name>

# 6. Uncordon node if memory stabilizes
kubectl uncordon <node-name>

# 7. If memory pressure persists, replace node
# Delete node and let GKE auto-repair create new one

If: Network unavailable

# 1. Check network connectivity from node
gcloud compute ssh <node-name> --zone=<zone> --command "ping -c 5 8.8.8.8"

# 2. Check CNI plugin status (GKE uses kubenet or Calico)
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl status kubenet"

# 3. Check for network plugin errors
gcloud compute ssh <node-name> --zone=<zone> --command "sudo journalctl -u kubenet --since '30 minutes ago'"

# 4. Restart network services (risky - only if node is already unusable)
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl restart kubenet"

# 5. If network issue persists, cordon and drain
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force

# 6. Delete node and let GKE replace it
gcloud compute instances delete <node-name> --zone=<zone>

If: Kubelet not responding

# 1. Check kubelet process
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl status kubelet"

# 2. Restart kubelet
gcloud compute ssh <node-name> --zone=<zone> --command "sudo systemctl restart kubelet"

# 3. Wait 2 minutes and check node status
kubectl get node <node-name>

# 4. If node returns to Ready, uncordon
kubectl uncordon <node-name>

# 5. If kubelet fails to start, check logs
gcloud compute ssh <node-name> --zone=<zone> --command "sudo journalctl -u kubelet -n 100"

# 6. If cannot resolve, cordon, drain, and delete node
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
gcloud compute instances delete <node-name> --zone=<zone>

If: Hardware failure (rare in GKE)

# 1. Check for hardware errors in system logs
gcloud compute ssh <node-name> --zone=<zone> --command "dmesg | grep -i error"

# 2. Check for I/O errors
gcloud compute ssh <node-name> --zone=<zone> --command "dmesg | grep -i 'i/o error'"

# 3. Cordon and drain immediately
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force

# 4. Delete node - GKE will create replacement
gcloud compute instances delete <node-name> --zone=<zone>

# 5. Monitor new node creation
kubectl get nodes -w

Escalation Criteria

Escalate to Senior Engineer if:

Multiple nodes NotReady simultaneously
Node cannot be drained (pods stuck in terminating state)
Network issues affecting entire node pool

Escalate to Engineering Lead if:

30% of nodes NotReady
Node failure pattern suggests cluster-wide issue
Auto-repair not creating replacement nodes

Escalate to VP Engineering + GCP Support if:

Complete cluster failure (all nodes NotReady)
GKE control plane unreachable
Suspected GCP infrastructure issue

3. HighErrorRate

Alert Definition:

alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
severity: critical

Impact: Users experiencing errors (500, 502, 503, 504). Service availability degraded.

Investigation Steps

Step 1: Identify affected service

# Check error rate in Grafana
# Dashboard: GKE Service Health
# Panel: "Error Rate (5xx) by Service"
# Identify which service has >10% error rate

Step 2: Check recent deployments

# List recent rollouts
kubectl rollout history deployment/<deployment-name> -n <namespace>

# Check when error rate started
# Compare with deployment timestamp in Grafana

Step 3: Analyze error patterns

# Query Loki for error logs
# LogQL: {namespace="<namespace>", service="<service>", level="error"} |= "5xx" | json

# Look for patterns:
# - Specific endpoints failing
# - Common error messages
# - Correlation with other services

Step 4: Check dependencies

# Check if errors are due to downstream dependencies
# Use Jaeger to trace requests
# Navigate to https://jaeger.octollm.dev
# Search for service: <service-name>
# Filter by error status: error=true

# Common dependency issues:
# - Database connection pool exhausted
# - Redis timeout
# - External API rate limiting
# - Inter-service timeout

Step 5: Check resource utilization

# Check if service is resource-constrained
kubectl top pods -n <namespace> -l app=<service>

# Query CPU/memory in Grafana:
# CPU: rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])
# Memory: container_memory_usage_bytes{pod=~"<service>.*"}

Remediation Actions

If: Error rate increased after recent deployment

# 1. Verify deployment timing matches error spike
kubectl rollout history deployment/<deployment-name> -n <namespace>

# 2. Check logs from new pods
kubectl logs -n <namespace> -l app=<service> --tail=100 | grep -i error

# 3. Rollback to previous version
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# 4. Monitor error rate after rollback
# Should decrease within 2-5 minutes

# 5. Verify rollback success
kubectl rollout status deployment/<deployment-name> -n <namespace>

# 6. Create incident ticket with error logs
# Block new deployment until issue is resolved

If: Database connection pool exhausted

# 1. Verify in Grafana
# Query: db_pool_active_connections{service="<service>"} / db_pool_max_connections{service="<service>"}

# 2. Check for connection leaks
# Look for long-running queries in database
# PostgreSQL: SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes';

# 3. Restart service to clear connections
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 4. If issue persists, increase connection pool size
kubectl edit configmap -n <namespace> <service>-config
# Increase DB_POOL_SIZE (e.g., from 20 to 40)

# 5. Restart to apply new config
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 6. Monitor connection pool usage
# Should stay below 80% of max

If: Downstream service timeout

# 1. Identify failing dependency from Jaeger traces
# Look for spans with error=true and long duration

# 2. Check health of downstream service
kubectl get pods -n <namespace> -l app=<downstream-service>

# 3. Check latency of downstream service
# Grafana query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="<downstream-service>"}[5m]))

# 4. If downstream is slow, scale it up
kubectl scale deployment/<downstream-service> -n <namespace> --replicas=<new-count>

# 5. Increase timeout in calling service (if downstream is legitimately slow)
kubectl edit configmap -n <namespace> <service>-config
# Increase timeout (e.g., from 5s to 10s)

# 6. Restart calling service
kubectl rollout restart deployment/<deployment-name> -n <namespace>

If: External API rate limiting

# 1. Verify in logs
kubectl logs -n <namespace> -l app=<service> | grep -i "rate limit\|429\|too many requests"

# 2. Check rate limit configuration
kubectl get configmap -n <namespace> <service>-config -o yaml | grep -i rate

# 3. Reduce request rate (add caching, implement backoff)
# Short-term: Reduce replica count to lower total requests
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<reduced-count>

# 4. Implement circuit breaker (code change required)
# Long-term fix: Add circuit breaker to prevent cascading failures

# 5. Contact external API provider for rate limit increase
# Document current usage and justification for higher limits

If: Memory leak causing OOM errors

# 1. Identify memory trend in Grafana
# Query: container_memory_usage_bytes{pod=~"<service>.*"}
# Look for steady increase over time

# 2. Restart pods to free memory (temporary fix)
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 3. Increase memory limits (short-term mitigation)
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory

# 4. Enable heap profiling (if supported)
# Add profiling endpoint to service
# Analyze heap dumps to identify leak

# 5. Create high-priority bug ticket
# Attach memory graphs and profiling data
# Assign to owning team

Escalation Criteria

Escalate to Senior Engineer if:

Error rate >20% for >10 minutes
Rollback does not resolve issue
Root cause unclear after 15 minutes of investigation

Escalate to Engineering Lead if:

Error rate >50% (severe outage)
Multiple services affected
Estimated resolution time >1 hour

Escalate to VP Engineering if:

Complete service outage (100% error rate)
Customer-reported errors trending on social media
Revenue-impacting outage

4. DatabaseConnectionPoolExhausted

Alert Definition:

alert: DatabaseConnectionPoolExhausted
expr: db_pool_active_connections / db_pool_max_connections > 0.95
for: 5m
severity: critical

Impact: Services unable to query database. Users experience errors or timeouts.

Investigation Steps

Step 1: Verify pool exhaustion

# Check current pool usage in Grafana
# Query: db_pool_active_connections{service="<service>"} / db_pool_max_connections{service="<service>"}

# Check which service is affected
# Multiple services may share the same database

Step 2: Check for long-running queries

# Connect to database
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm

# List active connections by service
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY application_name;

# List long-running queries (>5 minutes)
SELECT pid, application_name, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '5 minutes'
ORDER BY query_start;

Step 3: Check for connection leaks

# List idle connections
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name;

# If idle count is very high for a service, there's likely a connection leak
# (Idle connections should be returned to pool)

Step 4: Check application logs for connection errors

# Query Loki
# LogQL: {namespace="<namespace>", service="<service>"} |= "connection" |= "error|timeout|exhausted"

# Common error messages:
# - "unable to acquire connection from pool"
# - "connection pool timeout"
# - "too many clients already"

Step 5: Check database resource usage

# Check database CPU/memory
kubectl top pod -n <namespace> <postgres-pod>

# Check database metrics in Grafana
# CPU: rate(container_cpu_usage_seconds_total{pod="<postgres-pod>"}[5m])
# Memory: container_memory_usage_bytes{pod="<postgres-pod>"}
# Disk I/O: rate(container_fs_reads_bytes_total{pod="<postgres-pod>"}[5m])

Remediation Actions

If: Long-running queries blocking connections

# 1. Identify problematic queries
SELECT pid, application_name, query_start, query
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '5 minutes';

# 2. Terminate long-running queries (careful!)
# Only terminate if you're sure it's safe
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = <pid>;

# 3. Monitor connection pool recovery
# Check Grafana: pool usage should drop below 95%

# 4. Investigate why queries are slow
# Use EXPLAIN ANALYZE to check query plans
# Look for missing indexes or inefficient joins

# 5. Optimize slow queries (code change)
# Create ticket with slow query details
# Add indexes if needed

If: Connection leak in application

# 1. Identify service with high idle connection count
SELECT application_name, COUNT(*)
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name;

# 2. Restart affected service to release connections
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 3. Monitor connection pool after restart
# Usage should drop significantly

# 4. Check application code for connection handling
# Ensure connections are properly closed in finally blocks
# Example (Python):
# try:
#     conn = pool.get_connection()
#     # Use connection
# finally:
#     conn.close()  # Must always close!

# 5. Implement connection timeout in pool config
# Add to service ConfigMap:
# DB_POOL_TIMEOUT: 30s
# DB_CONN_MAX_LIFETIME: 1h  # Force connection recycling

If: Pool size too small for load

# 1. Check current pool configuration
kubectl get configmap -n <namespace> <service>-config -o yaml | grep DB_POOL

# 2. Calculate required pool size
# Formula: (avg concurrent requests) * (avg query time in seconds) * 1.5
# Example: 100 req/s * 0.1s * 1.5 = 15 connections

# 3. Increase pool size
kubectl edit configmap -n <namespace> <service>-config
# Update DB_POOL_SIZE (e.g., from 20 to 40)

# 4. Verify database can handle more connections
# PostgreSQL max_connections setting (typically 100-200)
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm -c "SHOW max_connections;"

# 5. If database max_connections is too low, increase it
# Edit PostgreSQL ConfigMap or StatefulSet
# Requires database restart

# 6. Restart service to use new pool size
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 7. Monitor pool usage
# Target: <80% utilization under normal load

If: Database is resource-constrained

# 1. Check database CPU/memory
kubectl top pod -n <namespace> <postgres-pod>

# 2. If database CPU >80%, check for expensive queries
# Connect to database
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm

# Find most expensive queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

# 3. If database memory >90%, increase memory limits
kubectl edit statefulset -n <namespace> postgres
# Increase resources.limits.memory

# 4. If database disk I/O high, consider:
# - Adding indexes to reduce table scans
# - Increasing disk IOPS (resize persistent disk)
# - Enabling query result caching

# 5. Scale database vertically (larger instance)
# For managed databases (Cloud SQL), increase machine type
# For self-hosted, increase resource limits and restart

If: Too many services connecting to same database

# 1. Identify which services are using most connections
SELECT application_name, COUNT(*), MAX(query_start)
FROM pg_stat_activity
GROUP BY application_name
ORDER BY COUNT(*) DESC;

# 2. Implement connection pooling at database level
# Deploy PgBouncer between services and database
# PgBouncer multiplexes connections, reducing load on database

# 3. Configure PgBouncer
# pool_mode: transaction (default) or session
# max_client_conn: 1000 (much higher than database limit)
# default_pool_size: 20 (connections to actual database per pool)

# 4. Update service connection strings to point to PgBouncer
kubectl edit configmap -n <namespace> <service>-config
# Change DB_HOST from postgres:5432 to pgbouncer:6432

# 5. Restart services
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 6. Monitor PgBouncer metrics
# Check connection multiplexing ratio

Escalation Criteria

Escalate to Senior Engineer if:

Pool exhaustion persists after restarting services
Cannot identify source of connection leak
Database max_connections needs to be increased significantly

Escalate to Database Admin if:

Database CPU/memory consistently >90%
Slow queries cannot be optimized with indexes
Need to implement replication or sharding

Escalate to Engineering Lead if:

Database outage suspected
Need to migrate to larger database instance
Estimated resolution time >1 hour

5. HighLatency

Alert Definition:

alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
severity: critical

Impact: Slow response times for users. Degraded user experience. Possible timeout errors.

Investigation Steps

Step 1: Identify affected service and endpoints

# Check latency by service in Grafana
# Dashboard: GKE Service Health
# Panel: "Request Latency (P50/P95/P99)"
# Identify which service has P95 >1s

# Check latency by endpoint
# Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="<service>"}[5m])) by (handler)

Step 2: Check for recent changes

# List recent deployments
kubectl rollout history deployment/<deployment-name> -n <namespace>

# Check when latency increased
# Compare with deployment timestamp in Grafana

Step 3: Analyze slow requests with Jaeger

# Navigate to https://jaeger.octollm.dev
# 1. Search for service: <service-name>
# 2. Filter by min duration: >1s
# 3. Sort by longest duration
# 4. Click on slowest trace to see span breakdown

# Look for:
# - Which span is slowest (database query, external API call, internal processing)
# - Spans with errors
# - Multiple spans to same service (N+1 query problem)

Step 4: Check resource utilization

# Check if service is CPU-constrained
kubectl top pods -n <namespace> -l app=<service>

# Query CPU in Grafana:
# rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])

# If CPU near limit, service may be throttled

Step 5: Check dependencies

# Check if downstream services are slow
# Use Jaeger to identify which dependency is slow

# Check database query performance
# Connect to database and check slow query log

# Check cache hit rate (Redis)
# Grafana query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)

Remediation Actions

If: Slow database queries

# 1. Identify slow queries from Jaeger traces
# Look for database spans with duration >500ms

# 2. Connect to database and analyze query
kubectl exec -it -n <namespace> <postgres-pod> -- psql -U octollm

# 3. Use EXPLAIN ANALYZE to check query plan
EXPLAIN ANALYZE <slow-query>;

# 4. Look for sequential scans (bad - should use index)
# Look for "Seq Scan on <table>" in output

# 5. Create missing indexes
CREATE INDEX CONCURRENTLY idx_<table>_<column> ON <table>(<column>);
# CONCURRENTLY allows index creation without locking table

# 6. Monitor query performance after index creation
# Should see immediate improvement in latency

# 7. Update query to use index (if optimizer doesn't automatically)
# Sometimes need to rewrite query to use indexed columns

If: Low cache hit rate

# 1. Check cache hit rate in Grafana
# Query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
# Target: >80% hit rate

# 2. Check cache size
kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO memory

# 3. If cache is too small, increase memory
kubectl edit statefulset -n <namespace> redis
# Increase resources.limits.memory

# 4. Check cache TTL settings
# If TTL too short, increase it
kubectl get configmap -n <namespace> <service>-config -o yaml | grep CACHE_TTL

# 5. Increase cache TTL
kubectl edit configmap -n <namespace> <service>-config
# CACHE_TTL: 600s → 1800s (10m → 30m)

# 6. Restart service to use new TTL
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# 7. Consider implementing cache warming
# Pre-populate cache with frequently accessed data

If: CPU-constrained (throttled)

# 1. Check CPU usage in Grafana
# Query: rate(container_cpu_usage_seconds_total{pod=~"<service>.*"}[5m])
# Compare with CPU limit

# 2. If usage near limit, increase CPU allocation
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.cpu (e.g., from 500m to 1000m)

# 3. Monitor latency after change
# Should improve within 2-5 minutes

# 4. If latency persists, consider horizontal scaling
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>

# 5. Enable HPA for automatic scaling
kubectl autoscale deployment/<deployment-name> -n <namespace> \
  --cpu-percent=70 \
  --min=2 \
  --max=10

If: External API slow

# 1. Identify slow external API from Jaeger
# Look for HTTP client spans with long duration

# 2. Check if external API has status page
# Navigate to status page (e.g., status.openai.com)

# 3. Implement timeout and circuit breaker
# Prevent one slow API from blocking all requests
# Example circuit breaker config:
# - Failure threshold: 50%
# - Timeout: 5s
# - Cool-down period: 30s

# 4. Add caching for external API responses
# Cache responses for 5-15 minutes if data doesn't change frequently

# 5. Implement fallback mechanism
# Return cached/default data if external API is slow
# Example: Use stale cache data if API timeout

# 6. Contact external API provider
# Request status update or escalation

If: N+1 query problem

# 1. Identify N+1 pattern in Jaeger
# Multiple sequential database queries in a loop
# Example: 1 query to get list + N queries to get details

# 2. Check application code
# Look for loops that execute queries
# Example (bad):
# users = fetch_users()
# for user in users:
#     user.posts = fetch_posts(user.id)  # N queries!

# 3. Implement eager loading / batch fetching
# Fetch all related data in one query
# Example (good):
# users = fetch_users_with_posts()  # Single join query

# 4. Deploy fix and verify
# Check Jaeger - should see single query instead of N+1

# 5. Monitor latency improvement
# Should see significant reduction in P95/P99 latency

If: Latency increased after deployment

# 1. Verify timing correlation
kubectl rollout history deployment/<deployment-name> -n <namespace>

# 2. Check recent code changes
git log --oneline --since="2 hours ago"

# 3. Rollback deployment
kubectl rollout undo deployment/<deployment-name> -n <namespace>

# 4. Verify latency returns to normal
# Check Grafana - should improve within 5 minutes

# 5. Create incident ticket with details
# - Deployment that caused regression
# - Latency metrics before/after
# - Affected endpoints

# 6. Block deployment until fix is available
# Review code changes to identify performance regression

Escalation Criteria

Escalate to Senior Engineer if:

Latency >2s (P95) for >15 minutes
Root cause not identified within 20 minutes
Rollback does not resolve issue

Escalate to Database Admin if:

Database queries slow despite proper indexes
Need to optimize database configuration
Considering read replicas or sharding

Escalate to Engineering Lead if:

Latency affecting multiple services
Need architectural changes (caching layer, async processing)
Customer complaints or revenue impact

6. CertificateExpiringInSevenDays

Alert Definition:

alert: CertificateExpiringInSevenDays
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) < 604800
for: 1h
severity: critical

Impact: If certificate expires, users will see TLS errors and cannot access services via HTTPS.

Investigation Steps

Step 1: Identify expiring certificate

# List all certificates
kubectl get certificate --all-namespaces

# Check expiring certificates
kubectl get certificate --all-namespaces -o json | \
  jq -r '.items[] | select(.status.notAfter != null) |
  [.metadata.namespace, .metadata.name, .status.notAfter] | @tsv'

# Example output:
# octollm-monitoring  grafana-tls-cert  2025-12-05T10:30:00Z
# octollm-prod        api-tls-cert      2025-12-12T14:20:00Z

Step 2: Check certificate status

kubectl describe certificate -n <namespace> <cert-name>

# Look for:
# Status: Ready
# Renewal Time: (should be set)
# Events: Check for renewal attempts

Step 3: Check cert-manager logs

# Get cert-manager controller pod
kubectl get pods -n cert-manager

# Check logs for renewal attempts
kubectl logs -n cert-manager <cert-manager-pod> | grep <cert-name>

# Look for errors:
# - "rate limit exceeded" (Let's Encrypt)
# - "challenge failed" (DNS/HTTP validation failed)
# - "unable to connect to ACME server"

Step 4: Check ClusterIssuer status

# List ClusterIssuers
kubectl get clusterissuer

# Check issuer details
kubectl describe clusterissuer letsencrypt-prod

# Look for:
# Status: Ready
# ACME account registered: True

Step 5: Check DNS/Ingress for challenge

# For DNS-01 challenge (wildcard certs)
# Verify DNS provider credentials are valid
kubectl get secret -n cert-manager <dns-provider-secret>

# For HTTP-01 challenge
# Verify ingress is accessible
curl -I https://<domain>/.well-known/acme-challenge/test

Remediation Actions

If: Certificate not auto-renewing (cert-manager issue)

# 1. Check cert-manager is running
kubectl get pods -n cert-manager

# 2. If pods are not running, check for issues
kubectl describe pods -n cert-manager <cert-manager-pod>

# 3. Restart cert-manager if needed
kubectl rollout restart deployment -n cert-manager cert-manager
kubectl rollout restart deployment -n cert-manager cert-manager-webhook
kubectl rollout restart deployment -n cert-manager cert-manager-cainjector

# 4. Wait for cert-manager to be ready
kubectl wait --for=condition=ready pod -n cert-manager -l app=cert-manager --timeout=2m

# 5. Trigger manual renewal
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)

# 6. Check renewal progress
kubectl describe certificate -n <namespace> <cert-name>

# 7. Monitor events for successful renewal
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i certificate

If: Let's Encrypt rate limit exceeded

# 1. Check error message in cert-manager logs
kubectl logs -n cert-manager <cert-manager-pod> | grep "rate limit"

# Error example: "too many certificates already issued for: octollm.dev"

# 2. Let's Encrypt limits:
# - 50 certificates per registered domain per week
# - 5 duplicate certificates per week

# 3. Wait for rate limit to reset (1 week)
# No immediate fix - must wait

# 4. Temporary workaround: Use staging issuer
kubectl edit certificate -n <namespace> <cert-name>
# Change issuerRef.name: letsencrypt-prod → letsencrypt-staging

# 5. Staging cert will be issued (browsers will show warning)
# Acceptable for dev/staging, not for prod

# 6. For prod: Request rate limit increase from Let's Encrypt
# Email: limit-increases@letsencrypt.org
# Provide: domain, business justification, expected cert volume

# 7. Long-term: Reduce cert renewals
# Use wildcard certificates to cover multiple subdomains
# Increase cert lifetime (Let's Encrypt is 90 days, cannot change)

If: DNS challenge failing (DNS-01)

# 1. Check DNS provider credentials
kubectl get secret -n cert-manager <dns-provider-secret> -o yaml

# 2. Verify secret has correct keys
# For Google Cloud DNS:
# - key.json (service account key)
# For Cloudflare:
# - api-token

# 3. Test DNS provider access manually
# For Google Cloud DNS:
gcloud dns record-sets list --zone=<zone-name>

# For Cloudflare:
curl -X GET "https://api.cloudflare.com/client/v4/zones" \
  -H "Authorization: Bearer <token>"

# 4. If credentials are invalid, update secret
kubectl delete secret -n cert-manager <dns-provider-secret>
kubectl create secret generic -n cert-manager <dns-provider-secret> \
  --from-file=key.json=<path-to-new-key>

# 5. Restart cert-manager to pick up new credentials
kubectl rollout restart deployment -n cert-manager cert-manager

# 6. Trigger certificate renewal
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)

# 7. Check certificate status
kubectl describe certificate -n <namespace> <cert-name>

If: HTTP challenge failing (HTTP-01)

# 1. Check if ingress is accessible
curl -I https://<domain>/.well-known/acme-challenge/test

# 2. Verify ingress controller is running
kubectl get pods -n ingress-nginx  # or kube-system for GKE

# 3. Check if challenge path is reachable
kubectl get ingress -n <namespace>

# 4. Check ingress events
kubectl describe ingress -n <namespace> <ingress-name>

# 5. Verify DNS points to correct load balancer
nslookup <domain>
# Should resolve to ingress load balancer IP

# 6. Check firewall rules allow HTTP (port 80)
# Let's Encrypt requires HTTP for challenge, even for HTTPS certs
gcloud compute firewall-rules list --filter="name~'.*allow-http.*'"

# 7. If firewall blocks HTTP, create allow rule
gcloud compute firewall-rules create allow-http \
  --allow tcp:80 \
  --source-ranges 0.0.0.0/0

# 8. Retry certificate issuance
kubectl delete certificaterequest -n <namespace> $(kubectl get certificaterequest -n <namespace> -o name)

If: Manual certificate renewal needed (last resort)

# 1. Generate new certificate manually with certbot
certbot certonly --manual --preferred-challenges dns \
  -d <domain> -d *.<domain>

# 2. Update DNS TXT record as instructed by certbot
# Wait for DNS propagation (1-5 minutes)

# 3. Complete certbot challenge
# Certbot will save certificate to /etc/letsencrypt/live/<domain>/

# 4. Create Kubernetes secret with new certificate
kubectl create secret tls <cert-name> -n <namespace> \
  --cert=/etc/letsencrypt/live/<domain>/fullchain.pem \
  --key=/etc/letsencrypt/live/<domain>/privkey.pem

# 5. Update ingress to use new secret
kubectl edit ingress -n <namespace> <ingress-name>
# Verify spec.tls[].secretName matches new secret name

# 6. Verify HTTPS is working
curl -I https://<domain>

# 7. Fix cert-manager issue to prevent manual renewals in future
# This is a temporary workaround only!

Escalation Criteria

Escalate to Senior Engineer if:

Certificate expires in <3 days and not renewing
cert-manager issues persist after restart
DNS provider integration broken

Escalate to Engineering Lead if:

Certificate expires in <24 hours
Multiple certificates failing to renew
Need to switch certificate provider

Escalate to VP Engineering + Legal if:

Production certificate expired (causing outage)
Customer data exposure risk due to TLS issues
Need to purchase commercial certificates (e.g., DigiCert)

Warning Alert Procedures

7. HighNodeCPUUsage

Alert Definition:

alert: HighNodeCPUUsage
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) > 0.80
for: 10m
severity: warning

Impact: Node under high load. May affect performance. Pods may be throttled.

Investigation Steps

Identify affected node

kubectl top nodes

Check pod CPU usage on the node

kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=cpu

Check for CPU-intensive processes

# Use metrics in Grafana
# Query: topk(10, rate(container_cpu_usage_seconds_total{node="<node-name>"}[5m]))

Remediation Actions

Option 1: Scale application horizontally

# Add more replicas to distribute load
kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>

# Or enable HPA
kubectl autoscale deployment/<deployment-name> -n <namespace> \
  --cpu-percent=70 --min=2 --max=10

Option 2: Increase node CPU limits

# Edit deployment to increase CPU limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.cpu

Option 3: Add more nodes to cluster

# For GKE, resize node pool
gcloud container clusters resize <cluster-name> \
  --node-pool=<pool-name> \
  --num-nodes=<new-count> \
  --zone=<zone>

Escalation Criteria

Escalate if CPU >90% for >30 minutes
Escalate if performance degradation reported by users

8. HighNodeMemoryUsage

Alert Definition:

alert: HighNodeMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
for: 10m
severity: warning

Impact: Node running out of memory. May trigger OOM kills.

Investigation Steps

Identify affected node

kubectl top nodes

Check pod memory usage on the node

kubectl top pods --all-namespaces --field-selector spec.nodeName=<node-name> --sort-by=memory

Check for memory leaks

# Use Grafana to view memory trends
# Query: container_memory_usage_bytes{node="<node-name>"}
# Look for steadily increasing memory over time

Remediation Actions

Option 1: Restart memory-leaking pods

kubectl delete pod -n <namespace> <pod-name>
# Or rollout restart
kubectl rollout restart deployment/<deployment-name> -n <namespace>

Option 2: Increase memory limits

kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory

Option 3: Scale horizontally

kubectl scale deployment/<deployment-name> -n <namespace> --replicas=<new-count>

Escalation Criteria

Escalate if memory >95% for >15 minutes
Escalate if OOMKilled events detected

9. HighRequestLatency

Alert Definition:

alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
severity: warning

Impact: Slow responses. Users experiencing delays.

See detailed procedure in Critical Alert #5 (HighLatency) - same investigation and remediation steps apply.

10. PodOOMKilled

Alert Definition:

alert: PodOOMKilled
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
for: 1m
severity: warning

Impact: Container killed due to out-of-memory. Service may be unavailable briefly.

Investigation Steps

Identify OOMKilled pod

kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") |
  [.metadata.namespace, .metadata.name] | @tsv'

Check memory limits

kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Check memory usage before OOM

# Query in Grafana:
# container_memory_usage_bytes{pod="<pod-name>"}

Remediation Actions

Increase memory limits

kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory (e.g., 512Mi → 1Gi)

Check for memory leaks

# If memory increases steadily over time, likely a leak
# Enable heap profiling and investigate

Escalation Criteria

Escalate if OOMKilled repeatedly (>3 times in 1 hour)
Escalate if memory leak suspected

11. PersistentVolumeClaimPending

Alert Definition:

alert: PersistentVolumeClaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 5m
severity: warning

Impact: Pod cannot start due to unbound PVC. Service may be unavailable.

Investigation Steps

Identify pending PVC

kubectl get pvc --all-namespaces | grep Pending

Check PVC details

kubectl describe pvc -n <namespace> <pvc-name>

Check storage class

kubectl get storageclass
kubectl describe storageclass <storage-class-name>

Remediation Actions

If: No storage class exists

# Create storage class (example for GKE)
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
EOF

# Update PVC to use storage class
kubectl edit pvc -n <namespace> <pvc-name>
# Set storageClassName: fast-ssd

If: Storage quota exceeded

# Check quota
kubectl get resourcequota -n <namespace>

# Increase quota if needed
kubectl edit resourcequota -n <namespace> <quota-name>

If: Node affinity preventing binding

# Check if PV has node affinity that doesn't match any node
kubectl get pv | grep Available
kubectl describe pv <pv-name>

# May need to delete PV and recreate without affinity

Escalation Criteria

Escalate if PVC pending for >15 minutes
Escalate if quota increase needed

12. DeploymentReplicasMismatch

Alert Definition:

alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 15m
severity: warning

Impact: Deployment not at desired replica count. May affect availability or capacity.

Investigation Steps

Identify affected deployment

kubectl get deployments --all-namespaces
# Look for deployments where READY != DESIRED

Check pod status

kubectl get pods -n <namespace> -l app=<deployment-name>

Check for pod errors

kubectl describe pod -n <namespace> <pod-name>

Remediation Actions

If: Pods pending due to resources

# Check pending reason
kubectl describe pod -n <namespace> <pod-name> | grep -A 5 Events

# If "Insufficient cpu" or "Insufficient memory":
# - Add more nodes, or
# - Reduce resource requests

If: Image pull error

# Fix image name or credentials
kubectl set image deployment/<deployment-name> <container>=<correct-image> -n <namespace>

If: Pods crashing

# See PodCrashLoopBackOff procedure (Critical Alert #1)

Escalation Criteria

Escalate if mismatch persists for >30 minutes
Escalate if related to resource capacity issues

13. LowCacheHitRate

Alert Definition:

alert: LowCacheHitRate
expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.50
for: 15m
severity: warning

Impact: Increased latency and load on database due to cache misses.

Investigation Steps

Check cache hit rate in Grafana

# Query: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)

Check cache size and memory

kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO memory

Check cache eviction rate

kubectl exec -it -n <namespace> <redis-pod> -- redis-cli INFO stats | grep evicted_keys

Remediation Actions

If: Cache too small (frequent evictions)

# Increase Redis memory
kubectl edit statefulset -n <namespace> redis
# Increase resources.limits.memory

# Restart Redis
kubectl delete pod -n <namespace> <redis-pod>

If: Cache TTL too short

# Increase TTL in application config
kubectl edit configmap -n <namespace> <service>-config
# Increase CACHE_TTL value

# Restart service
kubectl rollout restart deployment/<deployment-name> -n <namespace>

If: Data access patterns changed

# Implement cache warming
# Pre-populate cache with frequently accessed data

# Adjust cache strategy (e.g., cache-aside vs. write-through)

Escalation Criteria

Escalate if hit rate <30% for >1 hour
Escalate if causing user-facing latency issues

Informational Alert Procedures

14. NewDeploymentDetected

Alert Definition:

alert: NewDeploymentDetected
expr: changes(kube_deployment_status_observed_generation[5m]) > 0
severity: info

Impact: Informational. No immediate action required.

Actions

Verify deployment in kubectl

kubectl rollout status deployment/<deployment-name> -n <namespace>

Monitor for related alerts (errors, crashes, latency)

# Check Alertmanager for any new critical/warning alerts

Document in change log if significant deployment

15. HPAScaledUp / HPAScaledDown

Alert Definition:

alert: HPAScaledUp
expr: changes(kube_horizontalpodautoscaler_status_current_replicas[5m]) > 0
severity: info

Impact: Informational. HPA adjusted replica count based on load.

Actions

Verify scaling event in Grafana

# Query: kube_horizontalpodautoscaler_status_current_replicas{hpa="<hpa-name>"}

Check if scaling is expected (e.g., during peak hours)
If scaling too frequent, adjust HPA thresholds:

kubectl edit hpa -n <namespace> <hpa-name>
# Adjust targetCPUUtilizationPercentage

16. ConfigMapChanged

Alert Definition:

alert: ConfigMapChanged
expr: changes(kube_configmap_info[5m]) > 0
severity: info

Impact: Informational. ConfigMap updated.

Actions

Identify changed ConfigMap

kubectl get configmap --all-namespaces --sort-by=.metadata.creationTimestamp

Verify change was intentional
Restart pods if needed to pick up new config:

kubectl rollout restart deployment/<deployment-name> -n <namespace>

Multi-Alert Scenarios

Scenario 1: Multiple Pods Crashing + Node NotReady

Symptoms:

Alert: PodCrashLoopBackOff (multiple pods)
Alert: NodeNotReady (1 node)

Root Cause: Node failure causing all pods on that node to crash.

Investigation:

Identify which pods are on the failing node
Check node status (see NodeNotReady procedure)

Remediation:

Cordon and drain the failing node
Pods will be rescheduled to healthy nodes
Replace the failed node

Scenario 2: High Error Rate + Database Connection Pool Exhausted

Symptoms:

Alert: HighErrorRate (>10% 5xx errors)
Alert: DatabaseConnectionPoolExhausted (>95% pool usage)

Root Cause: Connection pool exhaustion causing service errors.

Investigation:

Check if error rate corresponds to pool exhaustion timing
Check for long-running database queries

Remediation:

Restart service to release connections
Increase connection pool size
Optimize slow queries

Scenario 3: High Latency + Low Cache Hit Rate + High Database Load

Symptoms:

Alert: HighLatency (P95 >1s)
Alert: LowCacheHitRate (<50%)
Observation: High database CPU

Root Cause: Cache ineffectiveness causing excessive database load and slow queries.

Investigation:

Check cache hit rate timeline
Check database query volume
Identify cache misses by key pattern

Remediation:

Increase cache size
Increase cache TTL
Implement cache warming for common queries
Add database indexes for frequent queries

Escalation Decision Trees

Decision Tree 1: Service Outage

Service completely unavailable (100% error rate)?
├─ YES → CRITICAL - Page on-call engineer
│   ├─ Multiple services down?
│   │   ├─ YES → Page Engineering Lead + VP Eng
│   │   └─ NO → Continue troubleshooting
│   └─ Customer-reported on social media?
│       ├─ YES → Notify VP Eng + Customer Success
│       └─ NO → Continue troubleshooting
└─ NO → Check error rate
    ├─ >50% error rate?
    │   ├─ YES → Page on-call engineer
    │   └─ NO → Assign to on-call engineer (Slack)
    └─ <10% error rate?
        └─ YES → Create ticket, no immediate page

Decision Tree 2: Performance Degradation

Users reporting slow performance?
├─ YES → Check latency metrics
│   ├─ P95 >2s?
│   │   ├─ YES → CRITICAL - Page on-call engineer
│   │   └─ NO → Assign to on-call engineer
│   └─ P95 >1s but <2s?
│       ├─ YES → WARNING - Notify on-call engineer (Slack)
│       └─ NO → Create ticket for investigation
└─ NO → Proactive monitoring
    └─ P95 >1s for >15m?
        ├─ YES → Investigate proactively
        └─ NO → Continue monitoring

Decision Tree 3: Infrastructure Issue

Node or infrastructure alert?
├─ NodeNotReady?
│   ├─ Single node?
│   │   ├─ YES → Cordon, drain, replace
│   │   └─ NO → Multiple nodes - Page Engineering Lead
│   └─ >30% of nodes affected?
│       └─ YES → CRITICAL - Page VP Eng + GCP Support
└─ Disk/Memory pressure?
    ├─ Can be resolved with cleanup?
    │   ├─ YES → Clean up and monitor
    │   └─ NO → Page on-call engineer for node replacement

Post-Incident Actions

After Resolving Critical Alerts

Document resolution in incident tracker
- Root cause
- Actions taken
- Time to resolution
- Services affected
Create post-incident review (PIR) for critical incidents
- Timeline of events
- Impact assessment
- Contributing factors
- Action items to prevent recurrence
Update runbooks if new issue discovered
- Add new troubleshooting steps
- Update remediation procedures
- Document lessons learned
Implement preventive measures
- Add monitoring for early detection
- Improve alerting thresholds
- Automate remediation where possible
Communicate to stakeholders
- Internal: Engineering team, leadership
- External: Customers (if user-impacting)
- Status page update

Post-Incident Review Template

# Post-Incident Review: <Incident Title>

**Date**: YYYY-MM-DD
**Severity**: Critical / Warning
**Duration**: X hours Y minutes
**Services Affected**: <list>

## Summary

<1-2 sentence summary of incident>

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 14:00 | Alert triggered: HighErrorRate |
| 14:05 | On-call engineer acknowledged |
| 14:10 | Root cause identified: database connection pool exhausted |
| 14:15 | Mitigation applied: restarted service |
| 14:20 | Incident resolved: error rate returned to normal |

## Root Cause

<Detailed explanation of what caused the incident>

## Impact

- **User Impact**: X% of requests resulted in errors
- **Revenue Impact**: $Y estimated lost revenue
- **Duration**: X hours Y minutes

## Resolution

<What was done to resolve the incident>

## Contributing Factors

1. Factor 1
2. Factor 2

## Action Items

1. [ ] Increase connection pool size (Owner: @engineer, Due: YYYY-MM-DD)
2. [ ] Add alert for connection pool usage (Owner: @engineer, Due: YYYY-MM-DD)
3. [ ] Update runbook with new procedure (Owner: @engineer, Due: YYYY-MM-DD)

## Lessons Learned

- What went well
- What could be improved
- What we learned

Summary

This alert response procedures document provides detailed, step-by-step guidance for responding to all alerts in the OctoLLM monitoring system. Key points:

Critical alerts require immediate action (acknowledge within 5 minutes, resolve within 1 hour)
Warning alerts require timely action (acknowledge within 30 minutes, resolve within 4 hours)
Info alerts are informational and require no immediate action

Each procedure includes:

Alert definition and impact
Investigation steps with commands
Remediation actions with code examples
Escalation criteria

For all incidents:

Follow the general response workflow (acknowledge → assess → investigate → remediate → document → close)
Use the escalation decision trees to determine when to involve senior engineers or leadership
Complete post-incident reviews for critical incidents
Update runbooks with lessons learned

Related Documents:

Monitoring Runbook: /home/parobek/Code/OctoLLM/docs/operations/monitoring-runbook.md
Deployment Guide: /home/parobek/Code/OctoLLM/docs/deployment-guide.md
Backup and Restore: /home/parobek/Code/OctoLLM/docs/operations/backup-restore.md

Keyboard shortcuts

OctoLLM Documentation