Phase 1: Success Criteria & Acceptance Metrics
Version: 1.0 Date: 2025-11-12 Phase: Phase 1 - Proof of Concept Sign-Off Required: Tech Lead, QA Lead, Security Engineer, CTO
Executive Summary
Phase 1 is considered COMPLETE when ALL criteria in this document are met. No partial completion - all acceptance criteria must pass.
Categories:
- Functional: Do the components work?
- Performance: Do they meet latency/throughput targets?
- Quality: Are they well-tested and documented?
- Security: Are they secure against known attacks?
- Cost: Are we within budget and cost-efficient?
- Operational: Can we deploy and monitor them?
Pass Threshold: 95% of criteria must pass (allowance for 5% non-critical items to be deferred to Phase 2)
Functional Criteria (FC)
FC-001: Reflex Layer Operational
Priority: CRITICAL
Measurement: Health check returns 200 OK
Acceptance: ✅ GET /health returns {"status": "healthy", "redis": "connected"}
Verification Steps:
- Start Reflex Layer:
docker-compose up reflex-layer - Wait 10 seconds
- Test:
curl http://localhost:8001/health - Verify JSON response with status=healthy
Owner: Rust Engineer
FC-002: Reflex Layer Processes Requests
Priority: CRITICAL Measurement: POST /api/v1/reflex/process returns valid response Acceptance: ✅ Request with text succeeds, returns detection results
Test Case:
curl -X POST http://localhost:8001/api/v1/reflex/process \
-H "Content-Type: application/json" \
-d '{
"text": "My SSN is 123-45-6789 and email is test@example.com",
"check_pii": true,
"check_injection": true
}'
# Expected Response:
{
"safe": false,
"pii_detected": [
{"type": "ssn", "value": "***-**-****", "confidence": 0.98}
],
"injections": [],
"cached": false,
"latency_ms": 5.2
}
Owner: Rust Engineer
FC-003: Orchestrator Accepts Tasks
Priority: CRITICAL Measurement: POST /api/v1/tasks returns task_id Acceptance: ✅ Task submitted successfully, task_id (UUID4) returned
Test Case:
curl -X POST http://localhost:8000/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Echo hello world",
"constraints": ["Complete in <30 seconds"],
"context": {},
"acceptance_criteria": ["Output contains 'hello world'"],
"budget": {
"max_tokens": 5000,
"max_cost_usd": 0.10,
"max_time_seconds": 60
}
}'
# Expected Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Task accepted and queued for execution"
}
Owner: Python Engineer (Senior)
FC-004: Orchestrator Returns Task Status
Priority: CRITICAL Measurement: GET /api/v1/tasks/{task_id} returns current status Acceptance: ✅ Status endpoint returns task state (pending/in_progress/completed/failed)
Test Case:
# After submitting task above
curl http://localhost:8000/api/v1/tasks/550e8400-e29b-41d4-a716-446655440000
# Expected Response (if complete):
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"goal": "Echo hello world",
"result": {
"output": "hello world",
"metadata": {
"steps_executed": 2,
"total_duration_ms": 3420,
"cost_usd": 0.002
}
},
"created_at": "2025-11-12T10:00:00Z",
"updated_at": "2025-11-12T10:00:04Z"
}
Owner: Python Engineer (Senior)
FC-005: Planner Generates Valid Plans
Priority: CRITICAL Measurement: POST /api/v1/plan returns plan with 3-7 steps Acceptance: ✅ Plan has 3-7 steps, dependencies valid (DAG)
Test Case:
curl -X POST http://localhost:8002/api/v1/plan \
-H "Content-Type: application/json" \
-d '{
"goal": "List files in /tmp and count them",
"constraints": ["Use only allowed commands"],
"context": {}
}'
# Expected Response:
{
"plan": [
{
"step": 1,
"action": "List files in /tmp directory",
"required_arm": "executor",
"acceptance_criteria": ["Output shows file list"],
"depends_on": [],
"estimated_cost_tier": 1,
"estimated_duration_seconds": 5
},
{
"step": 2,
"action": "Count number of files",
"required_arm": "executor",
"acceptance_criteria": ["Output shows numeric count"],
"depends_on": [1],
"estimated_cost_tier": 1,
"estimated_duration_seconds": 5
}
],
"rationale": "Two-step plan: list files, then count them",
"confidence": 0.92,
"total_estimated_duration": 10,
"complexity_score": 0.2
}
Owner: Python Engineer (Senior)
FC-006: Executor Runs Allowed Commands
Priority: CRITICAL Measurement: POST /api/v1/execute runs echo/ls/grep commands successfully Acceptance: ✅ Command executes in sandbox, returns output and provenance
Test Case:
curl -X POST http://localhost:8003/api/v1/execute \
-H "Content-Type: application/json" \
-d '{
"action_type": "shell",
"command": "echo",
"args": ["Hello from Executor"],
"timeout_seconds": 10
}'
# Expected Response:
{
"success": true,
"output": "Hello from Executor\n",
"error": null,
"provenance": {
"command_hash": "a1b2c3d4e5f6...",
"timestamp": "2025-11-12T10:05:00Z",
"executor_version": "1.0.0",
"execution_duration_ms": 120,
"exit_code": 0,
"resource_usage": {
"cpu_time_ms": 5,
"max_memory_bytes": 1048576
}
}
}
Owner: Rust Engineer
FC-007: Executor Blocks Disallowed Commands
Priority: CRITICAL
Measurement: POST /api/v1/execute rejects rm, sudo, nc
Acceptance: ✅ Returns HTTP 403 Forbidden with clear error message
Test Case:
curl -X POST http://localhost:8003/api/v1/execute \
-H "Content-Type: application/json" \
-d '{
"action_type": "shell",
"command": "rm",
"args": ["-rf", "/"],
"timeout_seconds": 10
}'
# Expected Response (403 Forbidden):
{
"success": false,
"error": "Command 'rm' is not in the allowlist. Allowed commands: echo, cat, ls, grep, curl, wget, python3",
"output": null,
"provenance": null
}
Owner: Rust Engineer
FC-008: End-to-End Task Execution
Priority: CRITICAL Measurement: Submit task to Orchestrator, receive result Acceptance: ✅ Task flows through Reflex → Orchestrator → Planner → Executor → Result
Test Case:
# Submit task
TASK_ID=$(curl -s -X POST http://localhost:8000/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Echo the current date",
"constraints": ["Complete in <30 seconds"],
"context": {},
"acceptance_criteria": ["Output contains date"],
"budget": {"max_tokens": 5000, "max_cost_usd": 0.10, "max_time_seconds": 60}
}' | jq -r '.task_id')
# Wait for completion
sleep 10
# Check status
curl http://localhost:8000/api/v1/tasks/$TASK_ID | jq '.status'
# Expected: "completed"
curl http://localhost:8000/api/v1/tasks/$TASK_ID | jq '.result.output'
# Expected: Contains current date (e.g., "Tue Nov 12 10:15:00 UTC 2025")
Owner: QA Engineer
Performance Criteria (PC)
PC-001: Reflex Layer Throughput
Priority: HIGH Measurement: k6 load test achieves >10,000 req/sec sustained Acceptance: ✅ 10k req/sec for 60 seconds without errors
Test Script (tests/performance/k6-reflex.js):
import http from 'k6/http';
import { check } from 'k6';
export let options = {
vus: 100, // 100 virtual users
duration: '60s',
};
export default function() {
const payload = JSON.stringify({
text: 'Test message',
check_pii: true,
check_injection: true
});
const res = http.post('http://localhost:8001/api/v1/reflex/process', payload, {
headers: { 'Content-Type': 'application/json' },
});
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 10ms': (r) => r.timings.duration < 10,
});
}
Expected Output:
scenarios: (100.00%) 1 scenario, 100 max VUs, 1m30s max duration
data_received..................: 15 MB 250 kB/s
data_sent......................: 12 MB 200 kB/s
http_req_duration..............: avg=8.2ms p(95)=9.8ms p(99)=9.95ms
http_reqs......................: 610000 10166/s
vus............................: 100 min=100 max=100
Pass Criteria: http_reqs ≥ 10,000/s, p(95) latency < 10ms
Owner: Rust Engineer + QA Engineer
PC-002: Orchestrator Latency (P99)
Priority: HIGH Measurement: P99 latency <30s for 2-step tasks Acceptance: ✅ 99% of tasks complete in <30s
Test: Submit 100 simple 2-step tasks, measure completion time
Test Script:
import asyncio
import time
import httpx
async def submit_task(client, task_num):
start = time.time()
response = await client.post('http://localhost:8000/api/v1/tasks', json={
'goal': f'Echo task {task_num}',
'constraints': [],
'context': {},
'acceptance_criteria': [],
'budget': {'max_tokens': 5000, 'max_cost_usd': 0.10, 'max_time_seconds': 60}
})
task_id = response.json()['task_id']
# Poll for completion
while True:
status_response = await client.get(f'http://localhost:8000/api/v1/tasks/{task_id}')
status = status_response.json()['status']
if status in ['completed', 'failed']:
return time.time() - start
await asyncio.sleep(0.5)
async def main():
async with httpx.AsyncClient() as client:
tasks = [submit_task(client, i) for i in range(100)]
durations = await asyncio.gather(*tasks)
durations.sort()
p50 = durations[49]
p95 = durations[94]
p99 = durations[98]
print(f'P50: {p50:.2f}s, P95: {p95:.2f}s, P99: {p99:.2f}s')
assert p99 < 30.0, f"P99 latency {p99:.2f}s exceeds 30s target"
asyncio.run(main())
Pass Criteria: P50 <10s, P95 <25s, P99 <30s
Owner: QA Engineer
PC-003: Planner Success Rate
Priority: HIGH Measurement: 90%+ of 30 test tasks produce valid plans Acceptance: ✅ ≥27/30 test scenarios pass
Test Dataset: 30 diverse tasks in tests/planner/test_scenarios.json
- 10 simple (1-2 steps)
- 10 medium (3-5 steps)
- 10 complex (5-7 steps)
Test Script:
import pytest
@pytest.mark.parametrize('scenario', load_test_scenarios())
def test_planner_scenario(scenario):
response = requests.post('http://localhost:8002/api/v1/plan', json=scenario)
assert response.status_code == 200
plan = response.json()
assert 3 <= len(plan['plan']) <= 7
assert validate_dependencies(plan['plan']) # DAG check
assert plan['confidence'] >= 0.5
Pass Criteria: ≥90% test pass rate (27/30)
Owner: Python Engineer (Senior)
Quality Criteria (QC)
QC-001: Unit Test Coverage (Python)
Priority: HIGH Measurement: pytest-cov shows >85% coverage Acceptance: ✅ All Python services have >85% line coverage
Test Command:
# Orchestrator
cd services/orchestrator
pytest --cov=app --cov-report=term --cov-report=html tests/
# Planner Arm
cd services/arms/planner
pytest --cov=app --cov-report=term --cov-report=html tests/
# Expected Output:
# Name Stmts Miss Cover
# ----------------------------------------
# app/__init__.py 10 0 100%
# app/main.py 150 15 90%
# app/models.py 80 5 94%
# app/services/*.py 200 20 90%
# ----------------------------------------
# TOTAL 440 40 91%
Pass Criteria: TOTAL coverage ≥85% for each service
Owner: Python Engineer (Senior) + QA Engineer
QC-002: Unit Test Coverage (Rust)
Priority: HIGH Measurement: cargo tarpaulin shows >80% coverage Acceptance: ✅ All Rust services have >80% line coverage
Test Command:
# Reflex Layer
cd services/reflex-layer
cargo tarpaulin --out Xml --out Html --timeout 300
# Executor Arm
cd services/arms/executor
cargo tarpaulin --out Xml --out Html --timeout 300
# Expected Output:
# || Tested/Total Lines:
# || services/reflex-layer/src/main.rs: 45/50
# || services/reflex-layer/src/pii.rs: 120/140
# || services/reflex-layer/src/injection.rs: 80/95
# || services/reflex-layer/src/cache.rs: 60/70
# ||
# || 82.14% coverage, 305/355 lines covered
Pass Criteria: ≥80% line coverage for each service
Owner: Rust Engineer + QA Engineer
QC-003: All Health Checks Pass
Priority: CRITICAL
Measurement: docker-compose health checks show all services healthy
Acceptance: ✅ 6/6 services show healthy state
Test Command:
docker-compose up -d
sleep 30 # Wait for startup
docker-compose ps
# Expected Output:
# NAME STATUS PORTS
# postgres Up 30 seconds (healthy) 5432/tcp
# redis Up 30 seconds (healthy) 6379/tcp
# reflex-layer Up 30 seconds (healthy) 8001/tcp
# orchestrator Up 30 seconds (healthy) 8000/tcp
# planner-arm Up 30 seconds (healthy) 8002/tcp
# executor-arm Up 30 seconds (healthy) 8003/tcp
Pass Criteria: All 6 services show "(healthy)" status
Owner: DevOps Engineer
QC-004: Documentation Complete
Priority: MEDIUM Measurement: All README files exist and are >200 lines Acceptance: ✅ Each service has comprehensive README
Checklist:
-
services/reflex-layer/README.md(setup, config, examples) -
services/orchestrator/README.md(architecture, API, troubleshooting) -
services/arms/planner/README.md(system prompt, testing) -
services/arms/executor/README.md(security model, allowlist) -
infrastructure/docker-compose/README.md(quickstart, env vars) -
docs/guides/quickstart.md(15-minute getting started)
Owner: All engineers (each responsible for their service)
Security Criteria (SC)
SC-001: No Container Escapes
Priority: CRITICAL Measurement: Penetration test attempts to escape fail Acceptance: ✅ 0/10 escape attempts succeed
Penetration Test Suite (tests/security/container-escape-tests.sh):
#!/bin/bash
# Test 1: Mount host filesystem
attempt_escape "mount -t proc proc /proc"
# Test 2: Access Docker socket
attempt_escape "curl --unix-socket /var/run/docker.sock http://localhost/containers/json"
# Test 3: Privilege escalation
attempt_escape "sudo su"
# Test 4: Network access to unauthorized host
attempt_escape "curl http://internal-admin.example.com"
# Test 5-10: Additional escape vectors...
# Expected: All return 403 Forbidden or command rejected
Pass Criteria: 10/10 tests fail gracefully (no escapes)
Owner: Security Engineer
SC-002: No SQL Injection
Priority: HIGH Measurement: SQL injection tests fail Acceptance: ✅ Parameterized queries used, no injection possible
Test Case:
# Attempt SQL injection in task goal
curl -X POST http://localhost:8000/api/v1/tasks \
-H "Content-Type": application/json" \
-d '{
"goal": "Echo'; DROP TABLE tasks; --",
...
}'
# Expected: Task accepted, goal sanitized, no database impact
# Verify: Database 'tasks' table still exists
Pass Criteria: Database unaffected, task goal escaped
Owner: Python Engineer (Senior)
SC-003: Seccomp Profile Active
Priority: HIGH Measurement: Executor container has seccomp profile applied Acceptance: ✅ Restricted syscalls blocked
Test Command:
# Inspect executor container
docker inspect executor-arm | jq '.[0].HostConfig.SecurityOpt'
# Expected:
# [
# "seccomp=/path/to/octollm-seccomp.json"
# ]
# Test syscall blocking
docker exec executor-arm syscall-test
# Expected: Blocked syscalls (socket, mount, etc.) fail with EPERM
Pass Criteria: Seccomp profile active, dangerous syscalls blocked
Owner: Security Engineer
Cost Criteria (CC)
CC-001: LLM API Costs <$100
Priority: MEDIUM Measurement: Track token usage, calculate cost Acceptance: ✅ Phase 1 total LLM cost <$100
Tracking:
# Prometheus metric
llm_tokens_used_total{model="gpt-3.5-turbo",service="planner"}
# Cost calculation
gpt_35_input_tokens * $0.0015 / 1000 + gpt_35_output_tokens * $0.002 / 1000
gpt_4_input_tokens * $0.03 / 1000 + gpt_4_output_tokens * $0.06 / 1000
Target:
- GPT-3.5: 1.5M tokens × $0.002/1k = $3
- GPT-4: 1M tokens × $0.04/1k = $40
- Claude: 300k tokens × $0.015/1k = $4.50
- Total: ~$47.50 (well under $100)
Owner: Python Engineer (Senior)
CC-002: Cost per Task <50% of Direct GPT-4
Priority: HIGH Measurement: Average cost per task vs baseline Acceptance: ✅ OctoLLM <50% cost of direct GPT-4 call
Calculation:
Direct GPT-4:
- 2k input tokens × $0.03/1k = $0.06
- 500 output tokens × $0.06/1k = $0.03
- Total: $0.09 per task
OctoLLM (with GPT-3.5 planner + caching):
- Planner: 1.5k tokens × $0.002/1k = $0.003
- Executor: 0 LLM tokens (shell command)
- Cache hit (40%): $0.00
- Average: ~$0.025 per task
Savings: 72% reduction vs direct GPT-4
Pass Criteria: Average cost <$0.045 per task (50% of $0.09)
Owner: Python Engineer (Senior)
Operational Criteria (OC)
OC-001: Docker Compose Starts Cleanly
Priority: CRITICAL
Measurement: docker-compose up succeeds without errors
Acceptance: ✅ All 6 services start in <60 seconds
Test Command:
cd infrastructure/docker-compose
docker-compose down -v # Clean slate
time docker-compose up -d
# Expected:
# Creating network "octollm_default" ... done
# Creating volume "octollm_postgres_data" ... done
# Creating volume "octollm_redis_data" ... done
# Creating octollm_postgres_1 ... done
# Creating octollm_redis_1 ... done
# Creating octollm_reflex-layer_1 ... done
# Creating octollm_orchestrator_1 ... done
# Creating octollm_planner-arm_1 ... done
# Creating octollm_executor-arm_1 ... done
#
# real 0m45.321s
Pass Criteria: All services start in <60s, no errors
Owner: DevOps Engineer
OC-002: Metrics Exposed
Priority: MEDIUM Measurement: All services expose /metrics endpoint Acceptance: ✅ Prometheus can scrape all 4 components
Test Command:
curl http://localhost:8001/metrics | grep -c "^# HELP" # Reflex
curl http://localhost:8000/metrics | grep -c "^# HELP" # Orchestrator
curl http://localhost:8002/metrics | grep -c "^# HELP" # Planner
curl http://localhost:8003/metrics | grep -c "^# HELP" # Executor
# Expected: Each returns >10 metric definitions
Pass Criteria: All endpoints return Prometheus-formatted metrics
Owner: All engineers (each service)
OC-003: Demo Video Published
Priority: LOW Measurement: 5-minute demo video uploaded Acceptance: ✅ Video accessible, shows successful task execution
Content Checklist:
- (0:00-0:30) Architecture overview (diagram)
-
(0:30-1:00)
docker-compose updemo - (1:00-3:30) Submit 3 tasks (simple, medium, complex)
- (3:30-4:30) Show Grafana dashboard, logs
- (4:30-5:00) Phase 2 preview
Platform: YouTube (unlisted link) or Vimeo (password-protected)
Owner: DevOps Engineer
Final Sign-Off Checklist
Before declaring Phase 1 COMPLETE, verify:
Sprint Completion
- Sprint 1.1: Reflex Layer complete (26/26 subtasks)
- Sprint 1.2: Orchestrator MVP complete (32/32 subtasks)
- Sprint 1.3: Planner Arm complete (18/18 subtasks)
- Sprint 1.4: Executor Arm complete (28/28 subtasks)
- Sprint 1.5: Integration complete (15/15 subtasks)
Criteria Summary
- Functional Criteria: 8/8 passing (100%)
- Performance Criteria: 3/3 passing (100%)
- Quality Criteria: 4/4 passing (100%)
- Security Criteria: 3/3 passing (100%)
- Cost Criteria: 2/2 passing (100%)
- Operational Criteria: 3/3 passing (100%)
Total: 23/23 criteria passing (100%)
Stakeholder Sign-Off
- Tech Lead: Confirms all technical criteria met
- QA Lead: Confirms all test criteria met
- Security Engineer: Confirms all security criteria met
- CTO: Approves Phase 1 completion, authorizes Phase 2 start
Documentation
- All README files complete
- CHANGELOG.md updated with Phase 1 release notes
- Phase 1 retrospective held
- Phase 2 planning meeting scheduled
Phase 1 Success Declaration
Date: [To be filled] Declared By: [Tech Lead Name] Verified By: [QA Lead Name], [Security Engineer Name] Approved By: [CTO Name]
Phase 1 of OctoLLM is hereby declared COMPLETE and SUCCESSFUL. All acceptance criteria have been met or exceeded. The system is ready for Phase 2 development.
Key Achievements:
- 4 production-ready components (Reflex, Orchestrator, Planner, Executor)
- 119 subtasks completed across 5 sprints
- 340 hours of engineering effort
- <$100 LLM API costs
- 0 critical security vulnerabilities
-
90% test coverage
- Docker Compose deployment operational
- Demo video published
Phase 2 Authorization: APPROVED, start date [To be filled]
Document Version: 1.0 Last Updated: 2025-11-12 Next Review: Phase 1 Final Review Meeting Owner: Tech Lead Sign-Off Required: Tech Lead, QA Lead, Security Engineer, CTO