Troubleshooting Playbooks
Purpose: Step-by-step procedures for diagnosing and resolving common OctoLLM issues Audience: Operations engineers, SREs, on-call responders Prerequisites: Access to logs, metrics, and deployment environment
Overview
This document provides systematic troubleshooting procedures for common OctoLLM issues. Each playbook follows a structured format:
- Symptoms - How to recognize the problem
- Diagnosis - Steps to identify root cause
- Resolution - How to fix the issue
- Prevention - How to avoid recurrence
Table of Contents
- Service Unavailable
- High Latency
- Database Connection Issues
- Memory Leaks
- Task Routing Failures
- LLM API Failures
- Cache Performance Issues
- Resource Exhaustion
- Security Violations
- Data Corruption
Service Unavailable
Symptoms
- HTTP 503 responses from API
- Health check failures
- No response from service endpoints
- Alert:
ServiceDownorArmDown
Diagnosis
Step 1: Check service status
# Docker Compose
docker compose ps
# Kubernetes
kubectl get pods -n octollm
kubectl describe pod <pod-name> -n octollm
Step 2: Check container logs
# Docker Compose
docker compose logs --tail=100 orchestrator
# Kubernetes
kubectl logs <pod-name> -n octollm --tail=100
Step 3: Check resource usage
# Docker
docker stats
# Kubernetes
kubectl top pods -n octollm
kubectl describe node <node-name>
Step 4: Check dependencies
# Verify database connections
docker compose exec orchestrator nc -zv postgres 5432
docker compose exec orchestrator nc -zv redis 6379
docker compose exec orchestrator nc -zv qdrant 6333
# Check database health
docker compose exec postgres pg_isready -U octollm
docker compose exec redis redis-cli ping
Resolution
Scenario A: Container crashed
# Check exit code and restart
docker compose ps
docker compose logs <service>
docker compose restart <service>
# Kubernetes
kubectl get pods -n octollm
kubectl logs <pod-name> -n octollm --previous
kubectl delete pod <pod-name> -n octollm # Force restart
Scenario B: Out of memory
# Increase memory limits
# In .env for Docker Compose:
ORCHESTRATOR_MEMORY_LIMIT=8g
# In Kubernetes:
kubectl edit deployment orchestrator -n octollm
# Update resources.limits.memory to higher value
# Restart service
docker compose up -d orchestrator
# or
kubectl rollout restart deployment orchestrator -n octollm
Scenario C: Database connection failure
# Restart database
docker compose restart postgres
# Verify connectivity
docker compose exec orchestrator ping postgres
# Check network
docker network inspect octollm_octollm-network
# Kubernetes: Check network policies
kubectl get networkpolicies -n octollm
Scenario D: Configuration error
# Validate environment variables
docker compose config
# Check configuration in running container
docker compose exec orchestrator env | grep POSTGRES
# Fix configuration in .env and restart
docker compose up -d orchestrator
Prevention
- Set up health checks: Ensure all services have proper liveness/readiness probes
- Resource reservations: Set CPU/memory requests and limits
- Monitoring: Alert on service availability (ServiceDown alert)
- Auto-restart: Use
restart: unless-stoppedin Docker Compose - Pod Disruption Budgets: Ensure minimum replicas in Kubernetes
High Latency
Symptoms
- Slow API responses (>5 seconds)
- Task processing delays
- Timeouts from clients
- Alert:
HighRequestLatency
Diagnosis
Step 1: Identify slow endpoints
# Query Prometheus for P95 latency by endpoint
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
# Check Grafana dashboard for latency breakdown
Step 2: Check resource utilization
# CPU usage
docker stats
# or
kubectl top pods -n octollm
# Memory pressure
free -h
# or
kubectl describe node <node-name>
Step 3: Identify bottlenecks
# Check database query performance
docker compose exec postgres psql -U octollm -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;"
# Check Redis performance
docker compose exec redis redis-cli --latency
# Check LLM API latency
# Review metrics: llm_api_duration_seconds
Step 4: Profile application
# Python profiling (add to orchestrator temporarily)
python -m cProfile -o profile.stats app/main.py
# View profile
python -m pstats profile.stats
> sort cumtime
> stats 20
Resolution
Scenario A: Database slow queries
-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_tasks_created_at ON tasks(created_at);
CREATE INDEX CONCURRENTLY idx_entities_type ON entities(entity_type);
-- Optimize frequently accessed queries
EXPLAIN ANALYZE SELECT * FROM tasks WHERE status = 'pending';
-- Update statistics
ANALYZE tasks;
VACUUM ANALYZE;
Scenario B: LLM API latency
# Implement request batching
# In orchestrator/app/services/llm_client.py
async def batch_requests(requests: List[Request]) -> List[Response]:
"""Batch multiple LLM requests into single API call"""
combined_prompt = "\n---\n".join([r.prompt for r in requests])
response = await self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": combined_prompt}]
)
# Split and return individual responses
return parse_batch_response(response)
# Implement caching for repeated queries
from functools import lru_cache
import hashlib
async def get_llm_response(prompt: str) -> str:
# Check Redis cache first
cache_key = f"llm:{hashlib.md5(prompt.encode()).hexdigest()}"
cached = await redis_client.get(cache_key)
if cached:
cache_hits_total.labels(cache_type="llm").inc()
return cached
# Make API call
response = await llm_client.generate(prompt)
# Cache for 1 hour
await redis_client.setex(cache_key, 3600, response)
return response
Scenario C: Resource contention
# Scale horizontally (Kubernetes)
kubectl scale deployment orchestrator --replicas=4 -n octollm
# Docker Compose: Update docker-compose.yml
services:
orchestrator:
deploy:
replicas: 3
# Scale vertically: Increase CPU/memory
kubectl edit deployment orchestrator -n octollm
# Update resources.limits
Scenario D: Network latency
# Check network latency between services
docker compose exec orchestrator time curl -s http://planner-arm:8100/health
# Optimize service communication
# Use connection pooling
# Implement circuit breakers
# Add retry logic with exponential backoff
Prevention
- Connection pooling: Configure database connection pools
- Caching strategy: Cache frequently accessed data
- Query optimization: Add indexes, optimize N+1 queries
- Request batching: Batch LLM API requests
- Rate limiting: Prevent resource exhaustion
- Horizontal scaling: Use auto-scaling based on metrics
Database Connection Issues
Symptoms
- Connection refused errors
- Connection timeout
psycopg2.OperationalErrororConnectionError- Alert:
PostgreSQLDownorHighDatabaseConnections
Diagnosis
Step 1: Verify database is running
# Check database status
docker compose ps postgres
docker compose exec postgres pg_isready -U octollm
# Kubernetes
kubectl get pods -l app=postgres -n octollm
kubectl logs -l app=postgres -n octollm
Step 2: Check connection limits
-- Check current connections
docker compose exec postgres psql -U octollm -c "
SELECT count(*) as current_connections,
(SELECT setting::int FROM pg_settings WHERE name='max_connections') as max_connections
FROM pg_stat_activity;"
-- View active connections
docker compose exec postgres psql -U octollm -c "
SELECT pid, usename, application_name, client_addr, state, query
FROM pg_stat_activity
WHERE state != 'idle';"
Step 3: Test connectivity
# From orchestrator container
docker compose exec orchestrator nc -zv postgres 5432
# Manual connection test
docker compose exec orchestrator psql -h postgres -U octollm -d octollm -c "SELECT 1;"
Step 4: Check network configuration
# Docker network
docker network inspect octollm_octollm-network
# Kubernetes network policy
kubectl describe networkpolicy -n octollm
Resolution
Scenario A: Connection pool exhausted
# Increase pool size in orchestrator/app/database/connection.py
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(
DATABASE_URL,
pool_size=20, # Increased from 5
max_overflow=40, # Increased from 10
pool_timeout=30,
pool_recycle=3600,
pool_pre_ping=True, # Verify connections before use
)
Scenario B: Too many open connections
-- Increase max_connections in PostgreSQL
docker compose exec postgres psql -U octollm -c "
ALTER SYSTEM SET max_connections = 200;
SELECT pg_reload_conf();"
-- Or update postgresql.conf
echo "max_connections = 200" >> data/postgres/postgresql.conf
docker compose restart postgres
Scenario C: Connection leak
# Fix connection leaks - always use context managers
# Bad (connection leak):
conn = await pool.acquire()
result = await conn.fetch("SELECT * FROM tasks")
# conn never released!
# Good (automatic cleanup):
async with pool.acquire() as conn:
result = await conn.fetch("SELECT * FROM tasks")
# conn automatically released
Scenario D: Network partition
# Docker: Recreate network
docker compose down
docker network prune
docker compose up -d
# Kubernetes: Check DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup postgres.octollm.svc.cluster.local
# Verify network policies allow traffic
kubectl get networkpolicies -n octollm
Prevention
- Connection pooling: Always use connection pools
- Context managers: Use
async withfor automatic cleanup - Health checks: Monitor database connection count
- Graceful shutdown: Close connections on service shutdown
- Connection timeout: Set reasonable timeout values
- Monitoring: Alert on high connection count
Memory Leaks
Symptoms
- Gradual memory increase over time
- OOMKilled pod restarts (Kubernetes)
- Swap usage increasing
- Alert:
HighMemoryUsage
Diagnosis
Step 1: Identify leaking service
# Monitor memory over time
docker stats
# Kubernetes
kubectl top pods -n octollm --watch
# Check for OOMKilled containers
kubectl get pods -n octollm -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}'
Step 2: Profile memory usage
# Add memory profiling to orchestrator
# Install: pip install memory-profiler
from memory_profiler import profile
@profile
async def process_task(task_id: str):
# Function code
pass
# Run with:
# python -m memory_profiler app/main.py
# Track object counts
import gc
import sys
def get_memory_usage():
"""Get current memory usage details"""
gc.collect()
object_counts = {}
for obj in gc.get_objects():
obj_type = type(obj).__name__
object_counts[obj_type] = object_counts.get(obj_type, 0) + 1
# Sort by count
sorted_counts = sorted(object_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_counts[:20] # Top 20 object types
Step 3: Check for common leak patterns
# 1. Unclosed connections
# BAD:
client = httpx.AsyncClient()
await client.get("http://example.com")
# client never closed!
# GOOD:
async with httpx.AsyncClient() as client:
await client.get("http://example.com")
# 2. Growing caches
# BAD:
cache = {} # Unbounded cache
cache[key] = value # Grows forever
# GOOD:
from cachetools import TTLCache
cache = TTLCache(maxsize=1000, ttl=3600)
# 3. Event listener leaks
# BAD:
emitter.on("event", handler) # Handler never removed
# GOOD:
emitter.on("event", handler)
# ... later:
emitter.off("event", handler)
Resolution
Scenario A: Unbounded cache
# Replace unbounded cache with TTL cache
# Before:
result_cache = {} # Grows indefinitely
# After:
from cachetools import TTLCache
result_cache = TTLCache(
maxsize=10000, # Max 10k items
ttl=3600 # 1 hour TTL
)
# Or use Redis with expiration
await redis_client.setex(key, 3600, value)
Scenario B: Connection leaks
# Audit all HTTP clients and database connections
# Create reusable client
from fastapi import FastAPI
import httpx
app = FastAPI()
@app.on_event("startup")
async def startup():
app.state.http_client = httpx.AsyncClient(
timeout=10.0,
limits=httpx.Limits(max_keepalive_connections=20)
)
@app.on_event("shutdown")
async def shutdown():
await app.state.http_client.aclose()
# Use shared client
async def call_arm(request):
client = app.state.http_client
response = await client.post("http://arm/execute", json=request)
return response
Scenario C: Large object retention
# Clear large objects after use
async def process_large_dataset(data):
# Process data
result = expensive_operation(data)
# Explicitly clear references
del data
gc.collect()
return result
# Use generators for large sequences
def iterate_tasks():
# BAD: Load all tasks into memory
tasks = Task.query.all() # Could be millions
for task in tasks:
yield process(task)
# GOOD: Use pagination
page = 0
while True:
tasks = Task.query.limit(100).offset(page * 100).all()
if not tasks:
break
for task in tasks:
yield process(task)
page += 1
Scenario D: Circular references
# Break circular references
# Problematic:
class Task:
def __init__(self):
self.subtasks = []
class SubTask:
def __init__(self, parent):
self.parent = parent # Circular reference
parent.subtasks.append(self)
# Fix with weak references:
import weakref
class SubTask:
def __init__(self, parent):
self.parent = weakref.ref(parent) # Weak reference
parent.subtasks.append(self)
def get_parent(self):
return self.parent() # De-reference
Prevention
- Use context managers: For all resources (files, connections, clients)
- Bounded caches: Use TTLCache or LRU with size limits
- Weak references: For parent-child relationships
- Regular profiling: Run memory profiler in staging
- Resource limits: Set memory limits to catch leaks early
- Monitoring: Track memory usage over time
Task Routing Failures
Symptoms
- Tasks stuck in "pending" state
- No appropriate arm found for task
- Routing scores all zero
- Tasks timing out
Diagnosis
Step 1: Check task details
# View task in database
docker compose exec postgres psql -U octollm -c "
SELECT task_id, goal, status, created_at, updated_at
FROM tasks
WHERE task_id = 'task-123';"
# Check task routing history
docker compose exec postgres psql -U octollm -c "
SELECT * FROM action_log
WHERE task_id = 'task-123'
ORDER BY timestamp DESC;"
Step 2: Verify arm availability
# Check arm health
for port in 8100 8101 8102 8103 8104 8105; do
echo -n "Port $port: "
curl -sf http://localhost:$port/health && echo "✓" || echo "✗"
done
# Check arm capabilities
curl http://localhost:8100/capabilities | jq
Step 3: Check orchestrator routing logic
# Enable debug logging
# In .env:
LOG_LEVEL=debug
docker compose restart orchestrator
# View routing decisions
docker compose logs -f orchestrator | grep -i "routing"
Step 4: Test routing manually
# In orchestrator container
docker compose exec orchestrator python
from app.services.router import ArmRouter
from app.models.task import TaskContract
router = ArmRouter()
task = TaskContract(
goal="Write a Python function",
constraints=[],
priority="medium"
)
scores = await router.score_arms(task)
print(scores)
Resolution
Scenario A: All arms down
# Restart arms
docker compose restart planner-arm executor-arm coder-arm judge-arm guardian-arm retriever-arm
# Kubernetes
kubectl rollout restart deployment -l app-type=arm -n octollm
Scenario B: Routing scoring issues
# Fix routing algorithm in orchestrator/app/services/router.py
async def score_arms(self, task: TaskContract) -> Dict[str, float]:
"""Score arms based on task requirements"""
scores = {}
for arm_name, arm_capability in self.registered_arms.items():
score = 0.0
# Check keyword matching
task_keywords = extract_keywords(task.goal.lower())
arm_keywords = arm_capability.keywords
keyword_matches = len(set(task_keywords) & set(arm_keywords))
score += keyword_matches * 10
# Check domain match
if arm_capability.domain in task.goal.lower():
score += 50
# Penalize if arm is unhealthy
if not await self.is_arm_healthy(arm_name):
score = 0
scores[arm_name] = score
# If no scores, default to planner
if all(s == 0 for s in scores.values()):
scores["planner"] = 100
return scores
Scenario C: Capabilities not registered
# Ensure arms register capabilities on startup
# In each arm's app/main.py
@app.on_event("startup")
async def register_with_orchestrator():
"""Register arm capabilities with orchestrator"""
capability = ArmCapability(
name="planner-arm",
domain="planning",
keywords=["plan", "decompose", "break down", "steps"],
url=f"http://{os.getenv('HOSTNAME')}:8100"
)
async with httpx.AsyncClient() as client:
response = await client.post(
"http://orchestrator:8000/api/v1/arms/register",
json=capability.dict()
)
if response.status_code != 200:
logger.error("Failed to register with orchestrator", error=response.text)
else:
logger.info("Successfully registered with orchestrator")
Scenario D: Task constraints too strict
# Relax constraints if no match found
async def route_task(self, task: TaskContract) -> str:
"""Route task to best arm"""
scores = await self.score_arms(task)
max_score_arm = max(scores, key=scores.get)
max_score = scores[max_score_arm]
# If no good match, try relaxing constraints
if max_score < 10:
logger.warning(
"No good arm match, relaxing constraints",
task_id=task.task_id,
original_goal=task.goal
)
# Remove optional constraints
task.constraints = [c for c in task.constraints if "must" in c.lower()]
# Re-score
scores = await self.score_arms(task)
max_score_arm = max(scores, key=scores.get)
return max_score_arm
Prevention
- Health checks: Ensure all arms have health endpoints
- Registration: Auto-register arms on startup
- Fallback routing: Always have a default arm (planner)
- Monitoring: Track routing failures
- Testing: Test routing logic with various task types
LLM API Failures
Symptoms
- 429 Too Many Requests errors
- 503 Service Unavailable from LLM provider
- Authentication errors
- Timeout errors
- Alert:
HighLLMAPIErrorRate
Diagnosis
Step 1: Check LLM API metrics
# Query Prometheus
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=rate(llm_api_calls_total{status="error"}[5m])'
# Check error logs
docker compose logs orchestrator | grep -i "llm.*error"
Step 2: Verify API key
# Test API key manually
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
# Check key in environment
docker compose exec orchestrator env | grep OPENAI_API_KEY
Step 3: Check rate limiting
# View rate limit headers from last request
docker compose logs orchestrator | grep -i "rate.*limit"
# Check current request rate
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=rate(llm_api_calls_total[1m]) * 60'
Resolution
Scenario A: Rate limiting (429 errors)
# Implement exponential backoff with jitter
import asyncio
import random
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
@retry(
retry=retry_if_exception_type(httpx.HTTPStatusError),
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(5)
)
async def call_llm_api(prompt: str) -> str:
"""Call LLM API with exponential backoff"""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}]
},
timeout=60.0
)
if response.status_code == 429:
# Add jitter to prevent thundering herd
await asyncio.sleep(random.uniform(0, 2))
response.raise_for_status()
return response.json()
# Implement request queuing
from asyncio import Queue, Semaphore
class LLMClient:
def __init__(self, max_concurrent=5, max_per_minute=50):
self.semaphore = Semaphore(max_concurrent)
self.rate_limiter = TokenBucket(max_per_minute, 60)
async def generate(self, prompt: str) -> str:
async with self.semaphore: # Limit concurrent requests
await self.rate_limiter.acquire() # Rate limit
return await self._call_api(prompt)
Scenario B: Service unavailable (503 errors)
# Implement circuit breaker pattern
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_llm_with_circuit_breaker(prompt: str) -> str:
"""Call LLM API with circuit breaker"""
try:
return await call_llm_api(prompt)
except Exception as e:
logger.error("LLM API call failed", error=str(e))
raise
# Circuit opens after 5 failures, waits 60s before retry
# Implement fallback to alternative provider
async def generate_with_fallback(prompt: str) -> str:
"""Try primary provider, fallback to secondary"""
try:
return await openai_client.generate(prompt)
except Exception as e:
logger.warning(
"OpenAI failed, falling back to Anthropic",
error=str(e)
)
return await anthropic_client.generate(prompt)
Scenario C: Timeout errors
# Increase timeout for long-running requests
client = httpx.AsyncClient(
timeout=httpx.Timeout(
connect=5.0,
read=120.0, # 2 minutes for completion
write=5.0,
pool=5.0
)
)
# Stream responses for long generations
async def stream_llm_response(prompt: str):
"""Stream LLM response chunks"""
async with client.stream(
"POST",
"https://api.openai.com/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
) as response:
async for chunk in response.aiter_bytes():
yield chunk
Scenario D: Authentication errors
# Rotate API key
# Update .env file
OPENAI_API_KEY=sk-new-key-here
# Reload configuration
docker compose up -d orchestrator
# Kubernetes: Update secret
kubectl create secret generic octollm-secrets \
--from-literal=OPENAI_API_KEY=sk-new-key \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment orchestrator -n octollm
Prevention
- Rate limiting: Implement token bucket or leaky bucket
- Circuit breakers: Prevent cascading failures
- Retries: Use exponential backoff with jitter
- Fallback providers: Have secondary LLM provider
- Caching: Cache LLM responses when possible
- Monitoring: Track API error rates and costs
Cache Performance Issues
Symptoms
- Low cache hit rate (<50%)
- Redis memory full
- Slow cache lookups
- Alert:
CacheMissRate
Diagnosis
Step 1: Check cache hit rate
# Query Prometheus
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))'
Step 2: Check Redis stats
# Redis info
docker compose exec redis redis-cli INFO stats
# Check memory usage
docker compose exec redis redis-cli INFO memory
# Check key count
docker compose exec redis redis-cli DBSIZE
# Sample keys
docker compose exec redis redis-cli --scan --pattern "*" | head -20
Step 3: Analyze cache usage patterns
# Monitor cache commands
docker compose exec redis redis-cli MONITOR
# Check slow queries
docker compose exec redis redis-cli SLOWLOG GET 10
Resolution
Scenario A: Cache eviction policy issues
# Check current policy
docker compose exec redis redis-cli CONFIG GET maxmemory-policy
# Set appropriate policy for use case
docker compose exec redis redis-cli CONFIG SET maxmemory-policy allkeys-lru
# Options:
# - allkeys-lru: Evict any key, LRU
# - volatile-lru: Evict keys with TTL, LRU
# - allkeys-lfu: Evict any key, LFU (least frequently used)
# - volatile-ttl: Evict keys with shortest TTL
Scenario B: Inefficient cache keys
# Bad: Too specific keys (low hit rate)
cache_key = f"task:{task_id}:{user_id}:{timestamp}"
# Good: Normalized keys
cache_key = f"task:{task_id}"
# Bad: Large values cached
await redis.set("large_dataset", json.dumps(huge_object)) # MB of data
# Good: Cache references or summaries
await redis.set(f"dataset:{id}:summary", summary) # Small summary
# Store full data in database
Scenario C: Missing cache warming
# Implement cache warming on startup
@app.on_event("startup")
async def warm_cache():
"""Pre-populate cache with frequently accessed data"""
logger.info("Warming cache...")
# Load arm capabilities
arms = await db.query("SELECT * FROM arms WHERE enabled = true")
for arm in arms:
await redis.setex(
f"arm:capability:{arm.name}",
3600,
json.dumps(arm.capabilities)
)
# Load common entity relationships
entities = await db.query(
"SELECT * FROM entities WHERE access_count > 100"
)
for entity in entities:
await redis.setex(
f"entity:{entity.id}",
3600,
json.dumps(entity.dict())
)
logger.info(f"Cache warmed with {len(arms) + len(entities)} entries")
Scenario D: Cache stampede
# Prevent cache stampede with locking
import asyncio
from contextlib import asynccontextmanager
class CacheWithLock:
def __init__(self, redis_client):
self.redis = redis_client
self.locks = {}
@asynccontextmanager
async def lock(self, key: str):
"""Acquire lock for cache key"""
lock_key = f"lock:{key}"
lock_id = str(uuid.uuid4())
# Try to acquire lock
while not await self.redis.set(lock_key, lock_id, nx=True, ex=10):
await asyncio.sleep(0.1) # Wait for lock
try:
yield
finally:
# Release lock
if await self.redis.get(lock_key) == lock_id:
await self.redis.delete(lock_key)
async def get_or_compute(self, key: str, compute_fn):
"""Get from cache or compute with lock"""
# Try cache first
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
# Cache miss - acquire lock to prevent stampede
async with self.lock(key):
# Double-check cache (another thread may have computed)
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
# Compute value
value = await compute_fn()
# Cache result
await self.redis.setex(key, 3600, json.dumps(value))
return value
Prevention
- Appropriate TTLs: Set expiration based on data change frequency
- Cache warming: Pre-populate cache on startup
- Consistent keys: Use normalized cache keys
- Monitoring: Track hit rate and memory usage
- Eviction policy: Choose policy matching access patterns
Resource Exhaustion
Symptoms
- CPU at 100%
- Memory at limit
- Disk space full
- Alert:
HighCPUUsage,HighMemoryUsage,DiskSpaceLow
Diagnosis
# Check resource usage
docker stats
# Kubernetes
kubectl top pods -n octollm
kubectl top nodes
# Check disk usage
df -h
docker system df
# Identify resource-heavy processes
docker compose exec orchestrator top
Resolution
CPU exhaustion:
# Identify CPU-heavy services
docker stats --no-stream | sort -k3 -hr
# Scale horizontally
kubectl scale deployment orchestrator --replicas=3 -n octollm
# Optimize code (add CPU profiling)
python -m cProfile app/main.py
Memory exhaustion:
# Clear caches
docker compose exec redis redis-cli FLUSHDB
# Restart services
docker compose restart
# Increase limits
kubectl edit deployment orchestrator -n octollm
Disk exhaustion:
# Clean up Docker
docker system prune -a --volumes
# Rotate logs
docker compose logs --no-log-prefix > /dev/null
# Clean old backups
find /backups -mtime +30 -delete
Prevention
- Resource limits: Set CPU/memory limits
- Auto-scaling: Configure HPA in Kubernetes
- Monitoring: Alert on resource usage
- Log rotation: Limit log file sizes
- Regular cleanup: Schedule cleanup jobs
Security Violations
Symptoms
- Alert:
SecurityViolationDetected - PII detected in logs
- Suspicious commands blocked
- Unauthorized access attempts
Diagnosis
# Check security logs
docker compose logs guardian-arm | grep -i "violation"
# Query security metrics
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=security_violations_total'
Resolution
# Review and update security rules
# In guardian-arm configuration
# Block command execution
docker compose exec guardian-arm cat /app/config/blocked_commands.txt
# Review PII detection patterns
docker compose logs guardian-arm | grep "PII detected"
# Update firewall rules if needed
Prevention
- Input validation: Validate all user inputs
- PII detection: Scan all inputs/outputs
- Audit logging: Log all security events
- Regular audits: Review security logs
- Security training: Educate team on security
Data Corruption
Symptoms
- Invalid data in database
- Foreign key violations
- Inconsistent entity relationships
- Application errors due to malformed data
Diagnosis
-- Check for orphaned records
SELECT * FROM relationships r
LEFT JOIN entities e1 ON r.from_entity_id = e1.entity_id
WHERE e1.entity_id IS NULL;
-- Check for invalid JSON
SELECT * FROM entities
WHERE jsonb_typeof(properties) != 'object';
-- Check constraints
SELECT conname, pg_get_constraintdef(oid)
FROM pg_constraint
WHERE conrelid = 'tasks'::regclass;
Resolution
-- Fix orphaned relationships
DELETE FROM relationships
WHERE from_entity_id NOT IN (SELECT entity_id FROM entities)
OR to_entity_id NOT IN (SELECT entity_id FROM entities);
-- Fix invalid JSON
UPDATE entities
SET properties = '{}'::jsonb
WHERE jsonb_typeof(properties) != 'object';
-- Restore from backup if needed
docker compose exec -T postgres psql -U octollm octollm < backup.sql
Prevention
- Foreign keys: Use database constraints
- Validation: Validate data before insert
- Transactions: Use atomic operations
- Backups: Regular automated backups
- Testing: Test data integrity
Quick Reference
Common Commands
# Check service health
curl http://localhost:8000/health
# View logs
docker compose logs -f [service]
# Restart service
docker compose restart [service]
# Check resource usage
docker stats
# Access database
docker compose exec postgres psql -U octollm
# Access Redis
docker compose exec redis redis-cli
# Check metrics
curl http://localhost:9090/metrics
Emergency Procedures
Complete system restart:
# Stop all services
docker compose down
# Clear caches (optional)
docker compose down -v
# Start services
docker compose up -d
# Verify health
./scripts/healthcheck.sh
Rollback deployment (Kubernetes):
# View rollout history
kubectl rollout history deployment orchestrator -n octollm
# Rollback to previous version
kubectl rollout undo deployment orchestrator -n octollm
# Rollback to specific revision
kubectl rollout undo deployment orchestrator --to-revision=3 -n octollm
Escalation Procedures
Level 1: On-call Engineer
- Service unavailable
- High latency
- Database connection issues
Actions:
- Follow relevant playbook
- Restart affected services
- Escalate if unresolved in 15 minutes
Level 2: Senior Engineer
- Memory leaks
- Resource exhaustion
- Data corruption
Actions:
- Deep diagnosis with profiling
- Code fixes if needed
- Escalate to engineering lead if architectural issue
Level 3: Engineering Lead
- Security violations
- Architectural issues
- Multi-service failures
Actions:
- Coordinate team response
- Make architectural decisions
- Communicate with stakeholders
See Also
- Monitoring and Alerting - Set up observability
- Performance Tuning - Optimize performance
- Kubernetes Deployment - Production deployment
- Docker Compose Setup - Local setup