ADR-002: Communication Patterns
Status: Accepted Date: 2025-11-10 Decision Makers: Architecture Team Consulted: Engineering Team
Context
OctoLLM has multiple components that need to communicate:
- Reflex Layer → Orchestrator (request preprocessing)
- Orchestrator → Arms (task execution)
- Arms → Arms (collaborative tasks)
- Arms → Memory Systems (knowledge retrieval/storage)
- Components → External Services (LLM APIs, webhooks)
Communication patterns must support:
- Synchronous request-response for task execution
- Asynchronous event notifications
- Low latency (<100ms for internal calls)
- Reliability and fault tolerance
- Observability and tracing
- Flexible routing and load balancing
Decision
We will use the following communication patterns:
1. HTTP/REST for Synchronous Operations
Use For:
- Reflex Layer → Orchestrator
- Orchestrator → Arms
- Arms → Memory Systems
- External API integrations
Protocol: HTTP/1.1 or HTTP/2 Format: JSON Authentication: JWT tokens with capability scopes
Example:
# Orchestrator calling Coder Arm
async def execute_code_task(task: TaskContract) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://coder-arm:8102/execute",
json=task.dict(),
headers={
"Authorization": f"Bearer {capability_token}",
"X-Request-ID": request_id
},
timeout=30.0
)
return response.json()["output"]
Reasons:
- Universal protocol, widely understood
- Excellent debugging tools
- Native HTTP client libraries
- OpenAPI documentation support
- Load balancer integration
- Request/response tracing
2. Redis Pub/Sub for Event Notifications
Use For:
- Task completion events
- System health events
- Audit log events
- Cache invalidation signals
Pattern: Publish-subscribe Channels: Topic-based routing
Example:
# Publisher (Orchestrator)
await redis.publish(
"events:task:completed",
json.dumps({
"task_id": task.task_id,
"status": "completed",
"timestamp": datetime.utcnow().isoformat()
})
)
# Subscriber (Monitoring Service)
pubsub = redis.pubsub()
pubsub.subscribe("events:task:*")
async for message in pubsub.listen():
if message["type"] == "message":
event = json.loads(message["data"])
handle_task_event(event)
Reasons:
- Decoupled producers and consumers
- No blocking on publisher side
- Multiple subscribers supported
- Built into existing Redis infrastructure
- Low latency (<5ms)
- Simple implementation
3. Direct HTTP for Arm-to-Arm Communication
Use For:
- Coder Arm → Judge Arm (code validation)
- Planner Arm → Executor Arm (plan execution)
- Retriever Arm → other Arms (knowledge lookup)
Pattern: Direct service-to-service HTTP calls Discovery: Kubernetes DNS or service registry
Example:
# Coder Arm requesting validation from Judge Arm
async def validate_code(code: str) -> bool:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://judge-arm:8103/validate",
json={"code": code, "language": "python"},
headers={"Authorization": f"Bearer {token}"}
)
return response.json()["is_valid"]
Reasons:
- Simple and direct
- Low latency
- Easy to trace with request IDs
- No message broker overhead
- Kubernetes service discovery
4. WebSocket for Real-Time Updates
Use For:
- Live task progress updates to clients
- Streaming LLM responses
- Real-time dashboard data
Protocol: WebSocket over HTTP Format: JSON messages
Example:
# Server
@app.websocket("/ws/tasks/{task_id}")
async def task_updates(websocket: WebSocket, task_id: str):
await websocket.accept()
try:
while True:
update = await get_task_update(task_id)
await websocket.send_json(update)
await asyncio.sleep(1)
except WebSocketDisconnect:
logger.info("Client disconnected", task_id=task_id)
# Client
async with websocket_connect(f"ws://localhost:8000/ws/tasks/{task_id}") as ws:
async for message in ws:
update = json.loads(message)
print(f"Task progress: {update['progress']}%")
Reasons:
- Bi-directional communication
- Lower overhead than polling
- Native browser support
- Streaming responses
- Real-time updates
Consequences
Positive
-
Simplicity:
- HTTP/REST is familiar to all developers
- No complex message broker to manage
- Standard debugging tools work
- Easy to test and mock
-
Performance:
- HTTP/2 multiplexing reduces overhead
- Direct calls minimize latency
- Redis pub/sub is very fast
- Connection pooling improves efficiency
-
Observability:
- HTTP requests easily traced
- Standard headers for correlation
- OpenTelemetry integration
- Request/response logging
-
Flexibility:
- Can add message broker later if needed
- Easy to switch between sync and async
- Support for multiple communication styles
- Cloud-native patterns
-
Reliability:
- HTTP retries well-understood
- Circuit breakers easy to implement
- Timeout handling straightforward
- Failure modes are clear
Negative
-
No Native Message Queue:
- No guaranteed delivery
- No persistent queuing
- Manual retry logic needed
- No dead letter queue
-
Pub/Sub Limitations:
- Messages not persisted
- No acknowledgment mechanism
- Subscribers must be online
- No ordering guarantees
-
Service Discovery:
- Requires DNS or service registry
- Hard-coded URLs in development
- More complex in multi-cluster setup
- Need health checks
-
Scalability Concerns:
- HTTP connection overhead at very high scale
- May need connection pooling tuning
- Pub/sub doesn't scale horizontally well
- Load balancing configuration required
Mitigation Strategies
-
Reliability:
- Implement retry logic with exponential backoff
- Use circuit breakers for external calls
- Add request timeouts
- Idempotent operations where possible
-
Message Durability:
- Use database for critical events
- Add audit log for important operations
- Implement task queue for background jobs
- Consider Kafka for high-volume events (future)
-
Service Discovery:
- Use Kubernetes DNS for production
- Environment variables for URLs
- Service mesh for advanced routing (future)
- Health checks and readiness probes
-
Performance:
- HTTP/2 for multiplexing
- Connection pooling
- Response compression
- Caching where appropriate
Alternatives Considered
1. gRPC for All Communication
Pros:
- Better performance than REST
- Strong typing with protobuf
- Bi-directional streaming
- Code generation
Cons:
- More complex than HTTP/REST
- Requires protobuf definitions
- Harder to debug
- Less universal tooling
- Steeper learning curve
Why Rejected: HTTP/REST simplicity outweighs gRPC performance benefits for our use case.
2. Message Broker (RabbitMQ/Kafka)
Pros:
- Guaranteed delivery
- Persistent queuing
- Complex routing
- Horizontal scaling
- Decoupling
Cons:
- Another component to manage
- More operational complexity
- Higher latency
- Resource overhead
- Overkill for current scale
Why Rejected: HTTP/REST with Redis pub/sub sufficient for current needs. Can add later if needed.
3. Service Mesh (Istio/Linkerd)
Pros:
- Advanced routing
- Automatic retries
- Circuit breakers
- mTLS security
- Observability
Cons:
- Complex to setup
- Resource overhead
- Steep learning curve
- Operational burden
- Overkill for current scale
Why Rejected: Too complex for initial deployment. May consider for larger deployments.
4. GraphQL for All APIs
Pros:
- Flexible queries
- Single endpoint
- Strong typing
- Batch requests
Cons:
- More complex than REST
- Caching harder
- N+1 query problem
- Learning curve
- Less suitable for internal APIs
Why Rejected: REST is simpler and sufficient for our internal APIs.
Implementation Guidelines
HTTP Best Practices
-
Use standard status codes:
- 200 OK: Success
- 201 Created: Resource created
- 400 Bad Request: Validation error
- 401 Unauthorized: Authentication required
- 403 Forbidden: Authorization failed
- 404 Not Found: Resource doesn't exist
- 429 Too Many Requests: Rate limit
- 500 Internal Server Error: Server error
- 503 Service Unavailable: Service down
-
Include correlation headers:
headers = { "X-Request-ID": request_id, "X-Correlation-ID": correlation_id, "Authorization": f"Bearer {token}" } -
Set appropriate timeouts:
timeout = httpx.Timeout( connect=5.0, # Connection timeout read=30.0, # Read timeout write=10.0, # Write timeout pool=5.0 # Pool timeout ) -
Use connection pooling:
client = httpx.AsyncClient( limits=httpx.Limits( max_keepalive_connections=20, max_connections=100 ) )
Event Publishing
-
Event schema:
{ "event_type": "task.completed", "timestamp": "2025-11-10T10:30:00Z", "source": "orchestrator", "data": { "task_id": "task-123", "status": "completed", "duration_ms": 1234 } } -
Channel naming:
- Format:
<domain>:<entity>:<action> - Examples:
events:task:completed,events:arm:registered
- Format:
References
- HTTP/2 Specification
- REST API Best Practices
- Redis Pub/Sub Documentation
- WebSocket Protocol
- OpenTelemetry
Last Review: 2025-11-10 Next Review: 2026-05-10 (6 months) Related ADRs: ADR-001, ADR-004, ADR-005