OctoLLM Monitoring Runbook

Last Updated: 2025-11-12 Version: 1.0.0 Status: Active Audience: Site Reliability Engineers, DevOps, On-Call Engineers

Overview
Quick Access
Grafana Usage
Prometheus Usage
Loki Log Queries
Jaeger Trace Analysis
Alert Investigation
Common Troubleshooting Scenarios
Escalation Procedures
Appendix

Overview

This runbook provides step-by-step procedures for using the OctoLLM monitoring stack to investigate issues, analyze performance, and respond to alerts.

Monitoring Stack Components

Component	Purpose	Access URL	Port
Grafana	Visualization and dashboards	https://grafana.octollm.dev	3000
Prometheus	Metrics collection and alerts	Port-forward only (prod)	9090
Loki	Log aggregation	Via Grafana datasource	3100
Jaeger	Distributed tracing	https://jaeger.octollm.dev	16686
Alertmanager	Alert routing	Port-forward only	9093

Key Metrics

Metric	Target	Critical Threshold
P99 Latency	< 30s	> 30s
Error Rate	< 1%	> 10%
CPU Usage	< 60%	> 80%
Memory Usage	< 70%	> 85%
Cache Hit Rate	> 60%	< 40%

Quick Access

Access Grafana (Production)

# Via browser (recommended)
open https://grafana.octollm.dev

# Default credentials (change immediately!)
Username: admin
Password: (stored in Kubernetes secret)

Access Prometheus (Port-Forward)

# Production environment
kubectl port-forward -n octollm-monitoring svc/prometheus 9090:9090

# Access at http://localhost:9090

Access Jaeger UI

# Via browser
open https://jaeger.octollm.dev

Access Alertmanager (Port-Forward)

kubectl port-forward -n octollm-monitoring svc/alertmanager 9093:9093

# Access at http://localhost:9093

Grafana Usage

Available Dashboards

OctoLLM provides 6 comprehensive dashboards:

GKE Cluster Overview (octollm-gke-cluster)
- Cluster-level CPU and memory usage
- Node count and pod status
- Resource utilization by namespace
Development Namespace (octollm-namespace-dev)
- Per-pod CPU and memory usage
- Container restart counts
- Request/limit utilization
Staging Namespace (octollm-namespace-staging)
- Similar to dev, focused on staging environment
Production Namespace (octollm-namespace-prod)
- Similar to dev, focused on production environment
Service Health (octollm-service-health)
- Request rates by service
- Error rates (5xx responses)
- P50/P95/P99 latency
- Database and Redis connections
Logs Overview (octollm-logs)
- Log volume by service
- Error rate visualization
- Top 10 error messages
- Live log stream

How to Navigate Dashboards

Open Grafana: https://grafana.octollm.dev
Navigate to Dashboards: Click the "Dashboards" icon (four squares) in the left sidebar
Select OctoLLM Folder: All OctoLLM dashboards are in the "OctoLLM" folder
Time Range: Use the time picker (top-right) to adjust the time range
- Default: Last 1 hour
- Recommended for troubleshooting: Last 6 hours or Last 24 hours
Refresh Rate: Set auto-refresh (top-right dropdown)
- Recommended: 30s for live monitoring

Common Dashboard Tasks

Check Overall System Health

Open GKE Cluster Overview dashboard
Check the gauge panels:
- CPU Usage < 80%? ✅ Healthy
- Memory Usage < 85%? ✅ Healthy
- All pods Running? ✅ Healthy
Scroll to "Resource Utilization" row
Check time series graphs for trends (spikes, sustained high usage)

Investigate High Error Rate

Open Service Health dashboard
Locate "Error Rate by Service (5xx)" panel
Identify which service has elevated errors
Note the timestamp when errors started
Jump to Logs Overview dashboard
Filter logs by service and error level
Review "Top 10 Error Messages" for patterns

Analyze Service Latency

Open Service Health dashboard
Scroll to "Latency Metrics" row
Compare P50, P95, and P99 latency panels
Identify services exceeding thresholds:
- P95 > 2s → Warning
- P99 > 10s → Warning
- P99 > 30s → Critical
If latency is high, jump to Jaeger for trace analysis

Monitor Database Connections

Open Service Health dashboard
Scroll to "Database Connections" row
Check PostgreSQL connection pool usage:
- Active connections < 10 (max 15) → Healthy
- If active ≥ 10 → Investigate slow queries
Check Redis connection pool:
- Active + Idle < 20 → Healthy

View Namespace-Specific Metrics

Open the appropriate namespace dashboard:
- octollm-dev for development
- octollm-staging for staging
- octollm-prod for production
Review "Pod Status" panel:
- All Running? ✅
- Any Failed or Pending? Investigate
Check "CPU Usage by Pod" and "Memory Usage by Pod"
Identify resource-hungry pods
Review "Container Restarts" panel:
- 0 restarts → Healthy
- 1-2 restarts → Monitor
- 3+ restarts → Investigate (likely CrashLoopBackOff)

Creating Custom Dashboards

If you need to create a custom dashboard:

Click "+" in the left sidebar
Select "Dashboard"
Click "Add new panel"
Select datasource: Prometheus, Loki, or Jaeger
Write PromQL, LogQL, or trace query
Configure visualization (time series, gauge, table, etc.)
Save dashboard with descriptive name and tags

Prometheus Usage

Accessing Prometheus UI

Prometheus is not exposed publicly for security. Use port-forwarding:

# Forward Prometheus port
kubectl port-forward -n octollm-monitoring svc/prometheus 9090:9090

# Access at http://localhost:9090

Writing PromQL Queries

CPU Usage Query

# Average CPU usage across all nodes
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU usage by specific service
sum(rate(container_cpu_usage_seconds_total{namespace="octollm-prod",pod=~"orchestrator.*"}[5m]))

Memory Usage Query

# Memory usage percentage
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

# Memory usage by pod
sum(container_memory_working_set_bytes{namespace="octollm-prod",pod=~"orchestrator.*"})

Request Rate Query

# Total request rate across all services
sum(rate(http_requests_total{namespace=~"octollm.*"}[5m]))

# Request rate by service
sum(rate(http_requests_total{namespace=~"octollm.*"}[5m])) by (job)

Error Rate Query

# Error rate (5xx responses) as percentage
(
  sum(rate(http_requests_total{status=~"5..",namespace=~"octollm.*"}[5m]))
  /
  sum(rate(http_requests_total{namespace=~"octollm.*"}[5m]))
) * 100

Latency Query (P95, P99)

# P95 latency by service
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=~"octollm.*"}[5m])) by (job, le))

# P99 latency by service
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=~"octollm.*"}[5m])) by (job, le))

Database Connection Pool Query

# Active database connections
sum(db_connections_active) by (job)

# Connection pool usage percentage
(db_connections_active / (db_connections_active + db_connections_idle)) * 100

Checking Alert Rules

In Prometheus UI, click "Alerts" in the top menu
View all configured alert rules
Check status:
- Inactive (green) → Rule condition not met, no alert
- Pending (yellow) → Rule condition met, waiting for for duration
- Firing (red) → Alert is active, sent to Alertmanager
Click on an alert name to see:
- Full alert query
- Current value
- Labels and annotations
- Active alerts (if firing)

Checking Alertmanager Status

Port-forward Alertmanager:

kubectl port-forward -n octollm-monitoring svc/alertmanager 9093:9093

Access http://localhost:9093:

Alerts Tab: View all active alerts
Silences Tab: View and create alert silences
Status Tab: View Alertmanager configuration

Creating Alert Silences

If you need to temporarily suppress alerts (e.g., during maintenance):

Access Alertmanager UI (port-forward)
Click "Silences" tab
Click "New Silence"
Fill in:
- Matchers: alertname="HighCPUUsage" OR namespace="octollm-prod"
- Start: Now
- Duration: 1h, 4h, 24h, etc.
- Creator: Your name/email
- Comment: Reason for silence (e.g., "Planned maintenance")
Click "Create"

Loki Log Queries

Accessing Loki via Grafana

Open Grafana: https://grafana.octollm.dev
Click "Explore" (compass icon) in left sidebar
Select "Loki" datasource from dropdown (top-left)
Write LogQL queries

LogQL Syntax Basics

# Basic log stream selector
{namespace="octollm-prod"}

# Filter by pod
{namespace="octollm-prod", pod=~"orchestrator.*"}

# Filter by log level
{namespace="octollm-prod", level="error"}

# Filter by service label
{service="orchestrator", level="error"}

# Combine multiple filters
{namespace="octollm-prod", service="orchestrator", level=~"error|warn"}

Common Log Queries

View All Logs from a Service

{namespace="octollm-prod", service="orchestrator"}

View Error Logs Only

{namespace="octollm-prod", level="error"}

Search for Specific Text in Logs

{namespace="octollm-prod"} |= "database connection failed"

Filter Out Specific Text

{namespace="octollm-prod"} != "health check"

Parse JSON Logs and Filter by Field

{namespace="octollm-prod"} | json | status_code >= 500

Count Error Rate Over Time

sum(rate({namespace="octollm-prod", level="error"}[1m])) by (service)

Top 10 Error Messages

topk(10, sum(count_over_time({namespace="octollm-prod", level="error"}[1h])) by (message))

Find Slow Requests (>1s)

{namespace="octollm-prod"} | json | duration > 1.0

Investigating Errors with Logs

Scenario: You receive an alert for high error rate in the orchestrator service.

Open Grafana Explore
Select Loki datasource

Query error logs:

{namespace="octollm-prod", service="orchestrator", level="error"}

Adjust time range to when the alert started (e.g., last 1 hour)
Review log messages for patterns:
- Database connection errors?
- LLM API errors (rate limiting, timeouts)?
- Internal exceptions?
Identify the error message that appears most frequently
Click on a log line to expand full details:
- Trace ID (if available) → Jump to Jaeger
- Request ID → Correlate with other logs
- Stack trace → Identify code location
Check surrounding logs (context) by clicking "Show Context"

Jaeger Trace Analysis

Accessing Jaeger UI

# Via browser
open https://jaeger.octollm.dev

Searching for Traces

Service Dropdown: Select service (e.g., orchestrator)
Operation Dropdown: Select operation (e.g., /api/v1/tasks)
Tags: Add filters (e.g., http.status_code=500)
Lookback: Select time range (e.g., last 1 hour)
Click "Find Traces"

Understanding Trace Visualizations

Trace Timeline View

Horizontal bars: Each bar is a span (operation)
Bar length: Duration of operation
Vertical position: Parent-child relationships (nested = child span)
Color: Service name (different services have different colors)

Trace Details

Click on a trace to view details:

Trace Summary (top):
- Total duration
- Number of spans
- Service count
- Errors (if any)
Span List (left):
- Hierarchical view of all spans
- Duration and start time for each span
Span Details (right, when clicked):
- Operation name
- Tags (metadata): http.method, http.url, http.status_code, etc.
- Logs (events within span)
- Process info: Service name, instance ID

Common Trace Analysis Scenarios

Investigate High Latency

Scenario: P99 latency for /api/v1/tasks exceeds 10 seconds.

Open Jaeger UI
Select service: orchestrator
Select operation: /api/v1/tasks (or POST /api/v1/tasks)
Set lookback: Last 1 hour
Sort by: Duration (descending)
Click on the slowest trace
Analyze the trace:
- Which span took the longest?
- Database query? (look for spans with db.* tags)
- LLM API call? (look for spans with llm.* tags)
- Network call? (look for spans with http.client.* tags)
Drill down into the slow span:
- Check tags for query parameters, request size, etc.
- Check logs for error messages or warnings
Compare with fast traces:
- Find a trace with normal latency
- Compare span durations to identify the bottleneck

Find Errors in Traces

Open Jaeger UI
Select service
Add tag filter: error=true
Click "Find Traces"
Click on a trace with errors (marked with red icon)
Identify error span:
- Look for red bar in timeline
- Check span tags for error.message or exception.type
- Check span logs for stack trace
Understand error context:
- What was the request?
- Which service/operation failed?
- Was it a client error (4xx) or server error (5xx)?

Trace End-to-End Request Flow

Scenario: Understand the complete flow of a request through all services.

Open Jaeger UI
Select service: orchestrator
Find a recent successful trace
Click on the trace
Analyze the flow:
- Orchestrator receives request
- Reflex Layer preprocesses (fast, <10ms)
- Planner Arm decomposes task
- Executor Arm performs actions
- Judge Arm validates output
- Orchestrator returns response
Check each span:
- Duration (is it reasonable?)
- Tags (what data was passed?)
- Logs (were there any warnings?)

Correlating Traces with Logs

If a trace has a trace_id, you can find related logs:

Copy the trace_id from Jaeger span
Open Grafana Explore with Loki datasource

Query:

{namespace="octollm-prod"} | json | trace_id="<PASTE_TRACE_ID>"

View all logs related to that trace

Alert Investigation

Alert Severity Levels

Severity	Response Time	Notification	Escalation
Critical	< 15 minutes	PagerDuty + Slack	Immediate
Warning	< 1 hour	Slack	After 4 hours
Info	Best effort	Slack (optional)	None

Critical Alerts

PodCrashLoopBackOff

Alert: Pod <namespace>/<pod> is crash looping (>3 restarts in 10 minutes).

Investigation Steps:

Check pod status:

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

View pod logs:

kubectl logs <pod-name> -n <namespace> --previous

Common causes:
- Application startup failure (missing env vars, config errors)
- OOMKilled (check kubectl describe pod for Reason: OOMKilled)
- Liveness probe failure (misconfigured health check)
Resolution:
- If OOMKilled: Increase memory limit
- If config error: Fix ConfigMap/Secret and restart
- If code bug: Rollback deployment

NodeNotReady

Alert: Kubernetes node <node> is not ready for >5 minutes.

Investigation Steps:

Check node status:

kubectl get nodes
kubectl describe node <node-name>

Check node conditions:
- Ready=False → Node is down
- MemoryPressure=True → Node is out of memory
- DiskPressure=True → Node is out of disk space

Check node logs (requires SSH access):

gcloud compute ssh <node-name>
journalctl -u kubelet -n 100

Resolution:
- If MemoryPressure: Drain node, evict pods, add more nodes
- If DiskPressure: Clear disk space, expand volume
- If node unresponsive: Replace node

HighErrorRate

Alert: Service <service> has error rate >10% for 5 minutes.

Investigation Steps:

Open Grafana Service Health dashboard
Identify the service with high errors

Check recent deployments:

kubectl rollout history deployment/<service> -n <namespace>

View error logs:

{namespace="<namespace>", service="<service>", level="error"}

Common causes:
- Recent deployment introduced bug
- Downstream service failure (database, LLM API)
- Configuration change
Resolution:
- If recent deployment: Rollback
```
kubectl rollout undo deployment/<service> -n <namespace>
```
- If downstream failure: Check dependent services
- If config issue: Fix ConfigMap/Secret

ServiceDown

Alert: Service <service> is unreachable for >2 minutes.

Investigation Steps:

Check pod status:

kubectl get pods -n <namespace> -l app=<service>

Check service endpoints:

kubectl get endpoints <service> -n <namespace>

Check recent events:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Resolution:
- If no pods running: Check deployment spec, resource quotas
- If pods running but unhealthy: Check liveness/readiness probes
- If service misconfigured: Fix service selector

DatabaseConnectionPoolExhausted

Alert: Database connection pool >95% utilization for 5 minutes.

Investigation Steps:

Check active connections in Grafana
Identify which service is using most connections
Check for connection leaks:
- Are connections being properly closed?
- Are there long-running queries?

View slow queries (PostgreSQL):

SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

Resolution:
- Kill slow/stuck queries
- Increase connection pool size (temporary)
- Fix connection leak in code

Warning Alerts

HighNodeCPUUsage

Alert: Node CPU usage >80% for 10 minutes.

Investigation Steps:

Identify resource-hungry pods:

kubectl top pods -n <namespace> --sort-by=cpu

Check for CPU throttling:

rate(container_cpu_cfs_throttled_seconds_total{namespace="<namespace>"}[5m])

Resolution:
- Scale down non-critical workloads
- Increase CPU limits for pods
- Add more cluster nodes (HorizontalPodAutoscaler)

HighNodeMemoryUsage

Alert: Node memory usage >85% for 10 minutes.

Investigation Steps:

Identify memory-hungry pods:

kubectl top pods -n <namespace> --sort-by=memory

Check for memory leaks:
- Review application logs for OOM warnings
- Check memory usage trend (gradual increase = leak)
Resolution:
- Restart pods with memory leaks
- Increase memory limits
- Add more cluster nodes

Common Troubleshooting Scenarios

Scenario 1: Sudden Spike in Latency

Symptoms:

P99 latency increased from 5s to 30s
No increase in error rate
Request rate unchanged

Investigation:

Check Grafana Service Health dashboard
- Identify which service has high latency
Open Jaeger, find slow traces
- Identify bottleneck span (database query, LLM call, etc.)

Check database performance:

rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m])

Check LLM API latency:

{namespace="octollm-prod"} | json | llm_duration_seconds > 10

Resolution:

If database slow: Check for missing indexes, slow queries
If LLM slow: Check provider status, implement caching

Scenario 2: Service Keeps Restarting

Symptoms:

Pod restart count increasing
No obvious errors in logs
Service health checks failing

Investigation:

Check pod events:

kubectl describe pod <pod-name> -n <namespace>

Check for OOMKilled:
- Look for Reason: OOMKilled in pod status
- Memory limit too low
Check liveness probe:
- Is probe misconfigured (timeout too short)?
- Is health endpoint actually healthy?

View logs from previous container:

kubectl logs <pod-name> -n <namespace> --previous

Resolution:

If OOMKilled: Increase memory limit
If liveness probe: Adjust probe settings or fix health endpoint
If application crash: Fix code bug

Scenario 3: Certificate Expiration

Symptoms:

Alert: Certificate expiring in <7 days
HTTPS services may be affected

Investigation:

Check certificate expiration:
```
kubectl get certificate -n <namespace>
```

Check cert-manager logs:

kubectl logs -n cert-manager deployment/cert-manager

Check certificate renewal attempts:

kubectl describe certificate <cert-name> -n <namespace>

Resolution:

If cert-manager renewal failed: Check DNS, ACME challenge logs

If manual renewal needed:

kubectl delete certificate <cert-name> -n <namespace>
# cert-manager will automatically create new certificate

Escalation Procedures

When to Escalate

Escalate to the next level if:

Critical alert not resolved within 15 minutes
Multiple critical alerts firing simultaneously
Data loss or security incident suspected
Root cause unclear after 30 minutes of investigation
Infrastructure issue beyond application scope (GCP outage, network failure)

Escalation Contacts

Level	Contact	Response Time	Scope
L1	On-Call Engineer	< 15 min	Application-level issues
L2	Senior SRE	< 30 min	Complex infrastructure issues
L3	Platform Lead	< 1 hour	Critical system-wide incidents
L4	CTO	< 2 hours	Business-critical outages

Escalation Process

Gather information:
- Alert name and severity
- Time alert started
- Services affected
- Investigation steps taken so far
- Current hypothesis
Contact next level:
- PagerDuty (for critical alerts)
- Slack #incidents channel
- Phone (for P0/P1 incidents)
Provide context:
- Share Grafana dashboard links
- Share relevant logs/traces
- Describe impact (users affected, data loss risk)
Continue investigation while waiting for response
Update incident channel with progress

Appendix

Useful kubectl Commands

# Get all pods in namespace
kubectl get pods -n octollm-prod

# Describe pod (detailed info)
kubectl describe pod <pod-name> -n octollm-prod

# View pod logs
kubectl logs <pod-name> -n octollm-prod

# View logs from previous container (if restarted)
kubectl logs <pod-name> -n octollm-prod --previous

# Follow logs in real-time
kubectl logs -f <pod-name> -n octollm-prod

# Execute command in pod
kubectl exec -it <pod-name> -n octollm-prod -- /bin/bash

# Port-forward to pod
kubectl port-forward -n octollm-prod <pod-name> 8000:8000

# Get events in namespace
kubectl get events -n octollm-prod --sort-by='.lastTimestamp'

# Get top pods by CPU/memory
kubectl top pods -n octollm-prod --sort-by=cpu
kubectl top pods -n octollm-prod --sort-by=memory

# Rollback deployment
kubectl rollout undo deployment/<service> -n octollm-prod

# Scale deployment
kubectl scale deployment/<service> -n octollm-prod --replicas=5

# Delete pod (will be recreated by deployment)
kubectl delete pod <pod-name> -n octollm-prod

Useful PromQL Aggregations

# Sum
sum(metric_name) by (label)

# Average
avg(metric_name) by (label)

# Count
count(metric_name) by (label)

# Min/Max
min(metric_name) by (label)
max(metric_name) by (label)

# Top K
topk(10, metric_name)

# Bottom K
bottomk(10, metric_name)

# Rate (per-second)
rate(metric_name[5m])

# Increase (total over time)
increase(metric_name[1h])

# Histogram quantile (P95, P99)
histogram_quantile(0.95, rate(metric_bucket[5m]))

Useful LogQL Patterns

# Stream selector
{label="value"}

# Multiple labels
{label1="value1", label2="value2"}

# Regex match
{label=~"regex"}

# Negative regex
{label!~"regex"}

# Contains text
{label="value"} |= "search text"

# Doesn't contain text
{label="value"} != "exclude text"

# Regex filter
{label="value"} |~ "regex"

# JSON parsing
{label="value"} | json

# Rate (logs per second)
rate({label="value"}[1m])

# Count over time
count_over_time({label="value"}[1h])

# Aggregations
sum(count_over_time({label="value"}[1h])) by (service)

GCP Commands

# List GKE clusters
gcloud container clusters list

# Get cluster credentials
gcloud container clusters get-credentials octollm-prod --region us-central1

# List nodes
gcloud compute instances list

# SSH to node
gcloud compute ssh <node-name>

# View GCS buckets (for Loki logs)
gsutil ls gs://octollm-loki-logs

# View bucket contents
gsutil ls -r gs://octollm-loki-logs

# Check Cloud SQL instances
gcloud sql instances list

# Check Redis instances
gcloud redis instances list --region us-central1

End of Runbook

For additional assistance, contact:

Slack: #octollm-sre
PagerDuty: octollm-oncall
Email: sre@octollm.dev

Keyboard shortcuts

OctoLLM Documentation