OctoLLM Monitoring Runbook
Last Updated: 2025-11-12 Version: 1.0.0 Status: Active Audience: Site Reliability Engineers, DevOps, On-Call Engineers
Table of Contents
- Overview
- Quick Access
- Grafana Usage
- Prometheus Usage
- Loki Log Queries
- Jaeger Trace Analysis
- Alert Investigation
- Common Troubleshooting Scenarios
- Escalation Procedures
- Appendix
Overview
This runbook provides step-by-step procedures for using the OctoLLM monitoring stack to investigate issues, analyze performance, and respond to alerts.
Monitoring Stack Components
| Component | Purpose | Access URL | Port |
|---|---|---|---|
| Grafana | Visualization and dashboards | https://grafana.octollm.dev | 3000 |
| Prometheus | Metrics collection and alerts | Port-forward only (prod) | 9090 |
| Loki | Log aggregation | Via Grafana datasource | 3100 |
| Jaeger | Distributed tracing | https://jaeger.octollm.dev | 16686 |
| Alertmanager | Alert routing | Port-forward only | 9093 |
Key Metrics
| Metric | Target | Critical Threshold |
|---|---|---|
| P99 Latency | < 30s | > 30s |
| Error Rate | < 1% | > 10% |
| CPU Usage | < 60% | > 80% |
| Memory Usage | < 70% | > 85% |
| Cache Hit Rate | > 60% | < 40% |
Quick Access
Access Grafana (Production)
# Via browser (recommended)
open https://grafana.octollm.dev
# Default credentials (change immediately!)
Username: admin
Password: (stored in Kubernetes secret)
Access Prometheus (Port-Forward)
# Production environment
kubectl port-forward -n octollm-monitoring svc/prometheus 9090:9090
# Access at http://localhost:9090
Access Jaeger UI
# Via browser
open https://jaeger.octollm.dev
Access Alertmanager (Port-Forward)
kubectl port-forward -n octollm-monitoring svc/alertmanager 9093:9093
# Access at http://localhost:9093
Grafana Usage
Available Dashboards
OctoLLM provides 6 comprehensive dashboards:
-
GKE Cluster Overview (
octollm-gke-cluster)- Cluster-level CPU and memory usage
- Node count and pod status
- Resource utilization by namespace
-
Development Namespace (
octollm-namespace-dev)- Per-pod CPU and memory usage
- Container restart counts
- Request/limit utilization
-
Staging Namespace (
octollm-namespace-staging)- Similar to dev, focused on staging environment
-
Production Namespace (
octollm-namespace-prod)- Similar to dev, focused on production environment
-
Service Health (
octollm-service-health)- Request rates by service
- Error rates (5xx responses)
- P50/P95/P99 latency
- Database and Redis connections
-
Logs Overview (
octollm-logs)- Log volume by service
- Error rate visualization
- Top 10 error messages
- Live log stream
How to Navigate Dashboards
- Open Grafana: https://grafana.octollm.dev
- Navigate to Dashboards: Click the "Dashboards" icon (four squares) in the left sidebar
- Select OctoLLM Folder: All OctoLLM dashboards are in the "OctoLLM" folder
- Time Range: Use the time picker (top-right) to adjust the time range
- Default: Last 1 hour
- Recommended for troubleshooting: Last 6 hours or Last 24 hours
- Refresh Rate: Set auto-refresh (top-right dropdown)
- Recommended: 30s for live monitoring
Common Dashboard Tasks
Check Overall System Health
- Open GKE Cluster Overview dashboard
- Check the gauge panels:
- CPU Usage < 80%? ✅ Healthy
- Memory Usage < 85%? ✅ Healthy
- All pods Running? ✅ Healthy
- Scroll to "Resource Utilization" row
- Check time series graphs for trends (spikes, sustained high usage)
Investigate High Error Rate
- Open Service Health dashboard
- Locate "Error Rate by Service (5xx)" panel
- Identify which service has elevated errors
- Note the timestamp when errors started
- Jump to Logs Overview dashboard
- Filter logs by service and error level
- Review "Top 10 Error Messages" for patterns
Analyze Service Latency
- Open Service Health dashboard
- Scroll to "Latency Metrics" row
- Compare P50, P95, and P99 latency panels
- Identify services exceeding thresholds:
- P95 > 2s → Warning
- P99 > 10s → Warning
- P99 > 30s → Critical
- If latency is high, jump to Jaeger for trace analysis
Monitor Database Connections
- Open Service Health dashboard
- Scroll to "Database Connections" row
- Check PostgreSQL connection pool usage:
- Active connections < 10 (max 15) → Healthy
- If active ≥ 10 → Investigate slow queries
- Check Redis connection pool:
- Active + Idle < 20 → Healthy
View Namespace-Specific Metrics
- Open the appropriate namespace dashboard:
octollm-devfor developmentoctollm-stagingfor stagingoctollm-prodfor production
- Review "Pod Status" panel:
- All Running? ✅
- Any Failed or Pending? Investigate
- Check "CPU Usage by Pod" and "Memory Usage by Pod"
- Identify resource-hungry pods
- Review "Container Restarts" panel:
- 0 restarts → Healthy
- 1-2 restarts → Monitor
- 3+ restarts → Investigate (likely CrashLoopBackOff)
Creating Custom Dashboards
If you need to create a custom dashboard:
- Click "+" in the left sidebar
- Select "Dashboard"
- Click "Add new panel"
- Select datasource: Prometheus, Loki, or Jaeger
- Write PromQL, LogQL, or trace query
- Configure visualization (time series, gauge, table, etc.)
- Save dashboard with descriptive name and tags
Prometheus Usage
Accessing Prometheus UI
Prometheus is not exposed publicly for security. Use port-forwarding:
# Forward Prometheus port
kubectl port-forward -n octollm-monitoring svc/prometheus 9090:9090
# Access at http://localhost:9090
Writing PromQL Queries
CPU Usage Query
# Average CPU usage across all nodes
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU usage by specific service
sum(rate(container_cpu_usage_seconds_total{namespace="octollm-prod",pod=~"orchestrator.*"}[5m]))
Memory Usage Query
# Memory usage percentage
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Memory usage by pod
sum(container_memory_working_set_bytes{namespace="octollm-prod",pod=~"orchestrator.*"})
Request Rate Query
# Total request rate across all services
sum(rate(http_requests_total{namespace=~"octollm.*"}[5m]))
# Request rate by service
sum(rate(http_requests_total{namespace=~"octollm.*"}[5m])) by (job)
Error Rate Query
# Error rate (5xx responses) as percentage
(
sum(rate(http_requests_total{status=~"5..",namespace=~"octollm.*"}[5m]))
/
sum(rate(http_requests_total{namespace=~"octollm.*"}[5m]))
) * 100
Latency Query (P95, P99)
# P95 latency by service
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=~"octollm.*"}[5m])) by (job, le))
# P99 latency by service
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=~"octollm.*"}[5m])) by (job, le))
Database Connection Pool Query
# Active database connections
sum(db_connections_active) by (job)
# Connection pool usage percentage
(db_connections_active / (db_connections_active + db_connections_idle)) * 100
Checking Alert Rules
- In Prometheus UI, click "Alerts" in the top menu
- View all configured alert rules
- Check status:
- Inactive (green) → Rule condition not met, no alert
- Pending (yellow) → Rule condition met, waiting for
forduration - Firing (red) → Alert is active, sent to Alertmanager
- Click on an alert name to see:
- Full alert query
- Current value
- Labels and annotations
- Active alerts (if firing)
Checking Alertmanager Status
Port-forward Alertmanager:
kubectl port-forward -n octollm-monitoring svc/alertmanager 9093:9093
Access http://localhost:9093:
- Alerts Tab: View all active alerts
- Silences Tab: View and create alert silences
- Status Tab: View Alertmanager configuration
Creating Alert Silences
If you need to temporarily suppress alerts (e.g., during maintenance):
- Access Alertmanager UI (port-forward)
- Click "Silences" tab
- Click "New Silence"
- Fill in:
- Matchers:
alertname="HighCPUUsage"ORnamespace="octollm-prod" - Start: Now
- Duration: 1h, 4h, 24h, etc.
- Creator: Your name/email
- Comment: Reason for silence (e.g., "Planned maintenance")
- Matchers:
- Click "Create"
Loki Log Queries
Accessing Loki via Grafana
- Open Grafana: https://grafana.octollm.dev
- Click "Explore" (compass icon) in left sidebar
- Select "Loki" datasource from dropdown (top-left)
- Write LogQL queries
LogQL Syntax Basics
# Basic log stream selector
{namespace="octollm-prod"}
# Filter by pod
{namespace="octollm-prod", pod=~"orchestrator.*"}
# Filter by log level
{namespace="octollm-prod", level="error"}
# Filter by service label
{service="orchestrator", level="error"}
# Combine multiple filters
{namespace="octollm-prod", service="orchestrator", level=~"error|warn"}
Common Log Queries
View All Logs from a Service
{namespace="octollm-prod", service="orchestrator"}
View Error Logs Only
{namespace="octollm-prod", level="error"}
Search for Specific Text in Logs
{namespace="octollm-prod"} |= "database connection failed"
Filter Out Specific Text
{namespace="octollm-prod"} != "health check"
Parse JSON Logs and Filter by Field
{namespace="octollm-prod"} | json | status_code >= 500
Count Error Rate Over Time
sum(rate({namespace="octollm-prod", level="error"}[1m])) by (service)
Top 10 Error Messages
topk(10, sum(count_over_time({namespace="octollm-prod", level="error"}[1h])) by (message))
Find Slow Requests (>1s)
{namespace="octollm-prod"} | json | duration > 1.0
Investigating Errors with Logs
Scenario: You receive an alert for high error rate in the orchestrator service.
- Open Grafana Explore
- Select Loki datasource
- Query error logs:
{namespace="octollm-prod", service="orchestrator", level="error"} - Adjust time range to when the alert started (e.g., last 1 hour)
- Review log messages for patterns:
- Database connection errors?
- LLM API errors (rate limiting, timeouts)?
- Internal exceptions?
- Identify the error message that appears most frequently
- Click on a log line to expand full details:
- Trace ID (if available) → Jump to Jaeger
- Request ID → Correlate with other logs
- Stack trace → Identify code location
- Check surrounding logs (context) by clicking "Show Context"
Jaeger Trace Analysis
Accessing Jaeger UI
# Via browser
open https://jaeger.octollm.dev
Searching for Traces
- Service Dropdown: Select service (e.g.,
orchestrator) - Operation Dropdown: Select operation (e.g.,
/api/v1/tasks) - Tags: Add filters (e.g.,
http.status_code=500) - Lookback: Select time range (e.g., last 1 hour)
- Click "Find Traces"
Understanding Trace Visualizations
Trace Timeline View
- Horizontal bars: Each bar is a span (operation)
- Bar length: Duration of operation
- Vertical position: Parent-child relationships (nested = child span)
- Color: Service name (different services have different colors)
Trace Details
Click on a trace to view details:
-
Trace Summary (top):
- Total duration
- Number of spans
- Service count
- Errors (if any)
-
Span List (left):
- Hierarchical view of all spans
- Duration and start time for each span
-
Span Details (right, when clicked):
- Operation name
- Tags (metadata):
http.method,http.url,http.status_code, etc. - Logs (events within span)
- Process info: Service name, instance ID
Common Trace Analysis Scenarios
Investigate High Latency
Scenario: P99 latency for /api/v1/tasks exceeds 10 seconds.
- Open Jaeger UI
- Select service:
orchestrator - Select operation:
/api/v1/tasks(orPOST /api/v1/tasks) - Set lookback: Last 1 hour
- Sort by: Duration (descending)
- Click on the slowest trace
- Analyze the trace:
- Which span took the longest?
- Database query? (look for spans with
db.*tags) - LLM API call? (look for spans with
llm.*tags) - Network call? (look for spans with
http.client.*tags)
- Drill down into the slow span:
- Check tags for query parameters, request size, etc.
- Check logs for error messages or warnings
- Compare with fast traces:
- Find a trace with normal latency
- Compare span durations to identify the bottleneck
Find Errors in Traces
- Open Jaeger UI
- Select service
- Add tag filter:
error=true - Click "Find Traces"
- Click on a trace with errors (marked with red icon)
- Identify error span:
- Look for red bar in timeline
- Check span tags for
error.messageorexception.type - Check span logs for stack trace
- Understand error context:
- What was the request?
- Which service/operation failed?
- Was it a client error (4xx) or server error (5xx)?
Trace End-to-End Request Flow
Scenario: Understand the complete flow of a request through all services.
- Open Jaeger UI
- Select service:
orchestrator - Find a recent successful trace
- Click on the trace
- Analyze the flow:
- Orchestrator receives request
- Reflex Layer preprocesses (fast, <10ms)
- Planner Arm decomposes task
- Executor Arm performs actions
- Judge Arm validates output
- Orchestrator returns response
- Check each span:
- Duration (is it reasonable?)
- Tags (what data was passed?)
- Logs (were there any warnings?)
Correlating Traces with Logs
If a trace has a trace_id, you can find related logs:
- Copy the
trace_idfrom Jaeger span - Open Grafana Explore with Loki datasource
- Query:
{namespace="octollm-prod"} | json | trace_id="<PASTE_TRACE_ID>" - View all logs related to that trace
Alert Investigation
Alert Severity Levels
| Severity | Response Time | Notification | Escalation |
|---|---|---|---|
| Critical | < 15 minutes | PagerDuty + Slack | Immediate |
| Warning | < 1 hour | Slack | After 4 hours |
| Info | Best effort | Slack (optional) | None |
Critical Alerts
PodCrashLoopBackOff
Alert: Pod <namespace>/<pod> is crash looping (>3 restarts in 10 minutes).
Investigation Steps:
-
Check pod status:
kubectl get pods -n <namespace> kubectl describe pod <pod-name> -n <namespace> -
View pod logs:
kubectl logs <pod-name> -n <namespace> --previous -
Common causes:
- Application startup failure (missing env vars, config errors)
- OOMKilled (check
kubectl describe podforReason: OOMKilled) - Liveness probe failure (misconfigured health check)
-
Resolution:
- If OOMKilled: Increase memory limit
- If config error: Fix ConfigMap/Secret and restart
- If code bug: Rollback deployment
NodeNotReady
Alert: Kubernetes node <node> is not ready for >5 minutes.
Investigation Steps:
-
Check node status:
kubectl get nodes kubectl describe node <node-name> -
Check node conditions:
Ready=False→ Node is downMemoryPressure=True→ Node is out of memoryDiskPressure=True→ Node is out of disk space
-
Check node logs (requires SSH access):
gcloud compute ssh <node-name> journalctl -u kubelet -n 100 -
Resolution:
- If
MemoryPressure: Drain node, evict pods, add more nodes - If
DiskPressure: Clear disk space, expand volume - If node unresponsive: Replace node
- If
HighErrorRate
Alert: Service <service> has error rate >10% for 5 minutes.
Investigation Steps:
-
Open Grafana Service Health dashboard
-
Identify the service with high errors
-
Check recent deployments:
kubectl rollout history deployment/<service> -n <namespace> -
View error logs:
{namespace="<namespace>", service="<service>", level="error"} -
Common causes:
- Recent deployment introduced bug
- Downstream service failure (database, LLM API)
- Configuration change
-
Resolution:
- If recent deployment: Rollback
kubectl rollout undo deployment/<service> -n <namespace> - If downstream failure: Check dependent services
- If config issue: Fix ConfigMap/Secret
- If recent deployment: Rollback
ServiceDown
Alert: Service <service> is unreachable for >2 minutes.
Investigation Steps:
-
Check pod status:
kubectl get pods -n <namespace> -l app=<service> -
Check service endpoints:
kubectl get endpoints <service> -n <namespace> -
Check recent events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' -
Resolution:
- If no pods running: Check deployment spec, resource quotas
- If pods running but unhealthy: Check liveness/readiness probes
- If service misconfigured: Fix service selector
DatabaseConnectionPoolExhausted
Alert: Database connection pool >95% utilization for 5 minutes.
Investigation Steps:
-
Check active connections in Grafana
-
Identify which service is using most connections
-
Check for connection leaks:
- Are connections being properly closed?
- Are there long-running queries?
-
View slow queries (PostgreSQL):
SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC; -
Resolution:
- Kill slow/stuck queries
- Increase connection pool size (temporary)
- Fix connection leak in code
Warning Alerts
HighNodeCPUUsage
Alert: Node CPU usage >80% for 10 minutes.
Investigation Steps:
-
Identify resource-hungry pods:
kubectl top pods -n <namespace> --sort-by=cpu -
Check for CPU throttling:
rate(container_cpu_cfs_throttled_seconds_total{namespace="<namespace>"}[5m]) -
Resolution:
- Scale down non-critical workloads
- Increase CPU limits for pods
- Add more cluster nodes (HorizontalPodAutoscaler)
HighNodeMemoryUsage
Alert: Node memory usage >85% for 10 minutes.
Investigation Steps:
-
Identify memory-hungry pods:
kubectl top pods -n <namespace> --sort-by=memory -
Check for memory leaks:
- Review application logs for OOM warnings
- Check memory usage trend (gradual increase = leak)
-
Resolution:
- Restart pods with memory leaks
- Increase memory limits
- Add more cluster nodes
Common Troubleshooting Scenarios
Scenario 1: Sudden Spike in Latency
Symptoms:
- P99 latency increased from 5s to 30s
- No increase in error rate
- Request rate unchanged
Investigation:
- Check Grafana Service Health dashboard
- Identify which service has high latency
- Open Jaeger, find slow traces
- Identify bottleneck span (database query, LLM call, etc.)
- Check database performance:
rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m]) - Check LLM API latency:
{namespace="octollm-prod"} | json | llm_duration_seconds > 10
Resolution:
- If database slow: Check for missing indexes, slow queries
- If LLM slow: Check provider status, implement caching
Scenario 2: Service Keeps Restarting
Symptoms:
- Pod restart count increasing
- No obvious errors in logs
- Service health checks failing
Investigation:
-
Check pod events:
kubectl describe pod <pod-name> -n <namespace> -
Check for OOMKilled:
- Look for
Reason: OOMKilledin pod status - Memory limit too low
- Look for
-
Check liveness probe:
- Is probe misconfigured (timeout too short)?
- Is health endpoint actually healthy?
-
View logs from previous container:
kubectl logs <pod-name> -n <namespace> --previous
Resolution:
- If OOMKilled: Increase memory limit
- If liveness probe: Adjust probe settings or fix health endpoint
- If application crash: Fix code bug
Scenario 3: Certificate Expiration
Symptoms:
- Alert: Certificate expiring in <7 days
- HTTPS services may be affected
Investigation:
-
Check certificate expiration:
kubectl get certificate -n <namespace> -
Check cert-manager logs:
kubectl logs -n cert-manager deployment/cert-manager -
Check certificate renewal attempts:
kubectl describe certificate <cert-name> -n <namespace>
Resolution:
- If cert-manager renewal failed: Check DNS, ACME challenge logs
- If manual renewal needed:
kubectl delete certificate <cert-name> -n <namespace> # cert-manager will automatically create new certificate
Escalation Procedures
When to Escalate
Escalate to the next level if:
- Critical alert not resolved within 15 minutes
- Multiple critical alerts firing simultaneously
- Data loss or security incident suspected
- Root cause unclear after 30 minutes of investigation
- Infrastructure issue beyond application scope (GCP outage, network failure)
Escalation Contacts
| Level | Contact | Response Time | Scope |
|---|---|---|---|
| L1 | On-Call Engineer | < 15 min | Application-level issues |
| L2 | Senior SRE | < 30 min | Complex infrastructure issues |
| L3 | Platform Lead | < 1 hour | Critical system-wide incidents |
| L4 | CTO | < 2 hours | Business-critical outages |
Escalation Process
-
Gather information:
- Alert name and severity
- Time alert started
- Services affected
- Investigation steps taken so far
- Current hypothesis
-
Contact next level:
- PagerDuty (for critical alerts)
- Slack #incidents channel
- Phone (for P0/P1 incidents)
-
Provide context:
- Share Grafana dashboard links
- Share relevant logs/traces
- Describe impact (users affected, data loss risk)
-
Continue investigation while waiting for response
-
Update incident channel with progress
Appendix
Useful kubectl Commands
# Get all pods in namespace
kubectl get pods -n octollm-prod
# Describe pod (detailed info)
kubectl describe pod <pod-name> -n octollm-prod
# View pod logs
kubectl logs <pod-name> -n octollm-prod
# View logs from previous container (if restarted)
kubectl logs <pod-name> -n octollm-prod --previous
# Follow logs in real-time
kubectl logs -f <pod-name> -n octollm-prod
# Execute command in pod
kubectl exec -it <pod-name> -n octollm-prod -- /bin/bash
# Port-forward to pod
kubectl port-forward -n octollm-prod <pod-name> 8000:8000
# Get events in namespace
kubectl get events -n octollm-prod --sort-by='.lastTimestamp'
# Get top pods by CPU/memory
kubectl top pods -n octollm-prod --sort-by=cpu
kubectl top pods -n octollm-prod --sort-by=memory
# Rollback deployment
kubectl rollout undo deployment/<service> -n octollm-prod
# Scale deployment
kubectl scale deployment/<service> -n octollm-prod --replicas=5
# Delete pod (will be recreated by deployment)
kubectl delete pod <pod-name> -n octollm-prod
Useful PromQL Aggregations
# Sum
sum(metric_name) by (label)
# Average
avg(metric_name) by (label)
# Count
count(metric_name) by (label)
# Min/Max
min(metric_name) by (label)
max(metric_name) by (label)
# Top K
topk(10, metric_name)
# Bottom K
bottomk(10, metric_name)
# Rate (per-second)
rate(metric_name[5m])
# Increase (total over time)
increase(metric_name[1h])
# Histogram quantile (P95, P99)
histogram_quantile(0.95, rate(metric_bucket[5m]))
Useful LogQL Patterns
# Stream selector
{label="value"}
# Multiple labels
{label1="value1", label2="value2"}
# Regex match
{label=~"regex"}
# Negative regex
{label!~"regex"}
# Contains text
{label="value"} |= "search text"
# Doesn't contain text
{label="value"} != "exclude text"
# Regex filter
{label="value"} |~ "regex"
# JSON parsing
{label="value"} | json
# Rate (logs per second)
rate({label="value"}[1m])
# Count over time
count_over_time({label="value"}[1h])
# Aggregations
sum(count_over_time({label="value"}[1h])) by (service)
GCP Commands
# List GKE clusters
gcloud container clusters list
# Get cluster credentials
gcloud container clusters get-credentials octollm-prod --region us-central1
# List nodes
gcloud compute instances list
# SSH to node
gcloud compute ssh <node-name>
# View GCS buckets (for Loki logs)
gsutil ls gs://octollm-loki-logs
# View bucket contents
gsutil ls -r gs://octollm-loki-logs
# Check Cloud SQL instances
gcloud sql instances list
# Check Redis instances
gcloud redis instances list --region us-central1
End of Runbook
For additional assistance, contact:
- Slack: #octollm-sre
- PagerDuty: octollm-oncall
- Email: sre@octollm.dev