Performance Baselines

This document defines the performance targets and measurement methodologies for VeridianOS. All measurements are taken on reference hardware to ensure reproducibility.

Reference Hardware

Primary Test System

  • CPU: AMD EPYC 7763 (64 cores, 128 threads)
  • Memory: 256GB DDR4-3200 (8 channels)
  • Storage: Samsung PM1733 NVMe (7GB/s)
  • Network: Mellanox ConnectX-6 (100GbE)

Secondary Test Systems

  • Intel: Xeon Platinum 8380 (40 cores)
  • ARM: Ampere Altra Max (128 cores)
  • RISC-V: SiFive Performance P650 (16 cores)

Core Kernel Performance

System Call Overhead

OperationTargetBaselineAchieved
Null syscall<50ns65ns48ns
getpid()<60ns75ns58ns
Simple capability check<100ns120ns95ns
Complex capability check<200ns250ns185ns

Context Switch Latency

Measured with two threads ping-ponging:

ScenarioTargetBaselineAchieved
Same core<300ns400ns285ns
Same CCX<500ns600ns470ns
Cross-socket<2μs2.5μs1.8μs
With FPU state<500ns650ns480ns

IPC Performance

Synchronous Messages

SizeTargetBaselineAchieved
64B<1μs1.2μs0.85μs
256B<1.5μs1.8μs1.3μs
1KB<2μs2.5μs1.9μs
4KB<5μs6μs4.5μs

Throughput

MetricTargetBaselineAchieved
Messages/sec (64B)>1M800K1.2M
Bandwidth (4KB msgs)>5GB/s4GB/s6.2GB/s
Concurrent channels>10K8K12K

Memory Management

Allocation Latency

SizeAllocatorTargetAchieved
4KBBitmap<200ns165ns
2MBBuddy<500ns420ns
1GBBuddy<1μs850ns
NUMA localHybrid<300ns275ns
NUMA remoteHybrid<800ns750ns

Page Fault Handling

TypeTargetAchieved
Anonymous page<2μs1.7μs
File-backed page<5μs4.2μs
Copy-on-write<3μs2.6μs
Huge page<10μs8.5μs

Scheduler Performance

Scheduling Latency

LoadTargetAchieved
Light (10 tasks)<1μs0.8μs
Medium (100 tasks)<2μs1.6μs
Heavy (1000 tasks)<5μs4.1μs
Overload (10K tasks)<20μs16μs

Load Balancing

MetricTargetAchieved
Migration latency<10μs8.2μs
Work stealing overhead<5%3.8%
Cache efficiency>90%92%

I/O Performance

Disk I/O

Using io_uring with registered buffers:

OperationSizeTargetAchieved
Random read4KB15μs12μs
Random write4KB20μs17μs
Sequential read1MB150μs125μs
Sequential write1MB200μs170μs

Throughput

WorkloadTargetAchieved
4KB random read IOPS>500K620K
Sequential read>6GB/s6.8GB/s
Sequential write>5GB/s5.7GB/s

Network I/O

Using kernel bypass (DPDK):

MetricTargetAchieved
Packet rate (64B)>50Mpps62Mpps
Latency (ping-pong)<5μs3.8μs
Bandwidth (TCP)>90Gbps94Gbps
Connections/sec>1M1.3M

Capability System

Operation Costs

OperationTargetAchieved
Capability creation<100ns85ns
Capability validation<50ns42ns
Capability derivation<150ns130ns
Revocation (single)<200ns175ns
Revocation (tree, 100 nodes)<50μs38μs

Lookup Performance

With 10,000 capabilities in table:

OperationTargetAchieved
Hash table lookup<100ns78ns
Cache hit<20ns15ns
Range check<50ns35ns

Benchmark Configurations

Microbenchmarks

#![allow(unused)]
fn main() {
#[bench]
fn bench_syscall_null(b: &mut Bencher) {
    b.iter(|| {
        unsafe { syscall!(SYS_NULL) }
    });
}

#[bench]
fn bench_ipc_roundtrip(b: &mut Bencher) {
    let (send, recv) = create_channel();
    
    b.iter(|| {
        send.send(Message::default()).unwrap();
        recv.receive().unwrap();
    });
}
}

System Benchmarks

#![allow(unused)]
fn main() {
pub struct SystemBenchmark {
    threads: Vec<JoinHandle<()>>,
    metrics: Arc<Metrics>,
}

impl SystemBenchmark {
    pub fn run_mixed_workload(&self) -> BenchResult {
        // 40% CPU bound
        // 30% I/O bound  
        // 20% IPC heavy
        // 10% Memory intensive
        
        let start = Instant::now();
        // ... workload execution
        let duration = start.elapsed();
        
        BenchResult {
            duration,
            throughput: self.metrics.operations() / duration.as_secs_f64(),
            latency_p50: self.metrics.percentile(0.50),
            latency_p99: self.metrics.percentile(0.99),
        }
    }
}
}

Performance Monitoring

Built-in Metrics

#![allow(unused)]
fn main() {
pub fn collect_performance_counters() -> PerfCounters {
    PerfCounters {
        cycles: read_pmc(PMC_CYCLES),
        instructions: read_pmc(PMC_INSTRUCTIONS),
        cache_misses: read_pmc(PMC_CACHE_MISSES),
        branch_misses: read_pmc(PMC_BRANCH_MISSES),
        ipc: instructions as f64 / cycles as f64,
    }
}
}

Continuous Monitoring

#![allow(unused)]
fn main() {
pub struct PerformanceMonitor {
    samplers: Vec<Box<dyn Sampler>>,
    interval: Duration,
}

impl PerformanceMonitor {
    pub async fn run(&mut self) {
        let mut interval = tokio::time::interval(self.interval);
        
        loop {
            interval.tick().await;
            
            for sampler in &mut self.samplers {
                let sample = sampler.sample();
                self.record(sample);
                
                // Alert on regression
                if sample.degraded() {
                    self.alert(sample);
                }
            }
        }
    }
}
}

Optimization Guidelines

Hot Path Optimization

  1. Minimize allocations: Use stack or pre-allocated buffers
  2. Reduce indirection: Direct calls over virtual dispatch
  3. Cache alignment: Align hot data to cache lines
  4. Branch prediction: Organize likely/unlikely paths
  5. SIMD usage: Vectorize where applicable

Example: Fast Path IPC

#![allow(unused)]
fn main() {
#[inline(always)]
pub fn fast_path_send(port: &Port, msg: &Message) -> Result<(), Error> {
    // Check if receiver is waiting (likely)
    if likely(port.has_waiter()) {
        // Direct transfer, no allocation
        let waiter = port.pop_waiter();
        
        // Copy to receiver's registers
        unsafe {
            copy_nonoverlapping(
                msg as *const _ as *const u64,
                waiter.regs_ptr(),
                8, // 64 bytes = 8 u64s
            );
        }
        
        waiter.wake();
        return Ok(());
    }
    
    // Slow path: queue message
    slow_path_send(port, msg)
}
}

Regression Testing

All performance-critical paths have regression tests:

[[bench]]
name = "syscall"
threshold = 50  # nanoseconds
tolerance = 10  # percent

[[bench]]
name = "ipc_latency"  
threshold = 1000  # nanoseconds
tolerance = 15    # percent

Automated CI runs these benchmarks and fails if regression detected.