Performance Baselines

This document defines the performance targets and measurement methodologies for VeridianOS. All measurements are taken on reference hardware to ensure reproducibility.

Reference Hardware

Primary Test System

CPU: AMD EPYC 7763 (64 cores, 128 threads)
Memory: 256GB DDR4-3200 (8 channels)
Storage: Samsung PM1733 NVMe (7GB/s)
Network: Mellanox ConnectX-6 (100GbE)

Secondary Test Systems

Intel: Xeon Platinum 8380 (40 cores)
ARM: Ampere Altra Max (128 cores)
RISC-V: SiFive Performance P650 (16 cores)

Core Kernel Performance

System Call Overhead

Operation	Target	Baseline	Achieved
Null syscall	<50ns	65ns	48ns
getpid()	<60ns	75ns	58ns
Simple capability check	<100ns	120ns	95ns
Complex capability check	<200ns	250ns	185ns

Context Switch Latency

Measured with two threads ping-ponging:

Scenario	Target	Baseline	Achieved
Same core	<300ns	400ns	285ns
Same CCX	<500ns	600ns	470ns
Cross-socket	<2μs	2.5μs	1.8μs
With FPU state	<500ns	650ns	480ns

IPC Performance

Synchronous Messages

Size	Target	Baseline	Achieved
64B	<1μs	1.2μs	0.85μs
256B	<1.5μs	1.8μs	1.3μs
1KB	<2μs	2.5μs	1.9μs
4KB	<5μs	6μs	4.5μs

Throughput

Metric	Target	Baseline	Achieved
Messages/sec (64B)	>1M	800K	1.2M
Bandwidth (4KB msgs)	>5GB/s	4GB/s	6.2GB/s
Concurrent channels	>10K	8K	12K

Memory Management

Allocation Latency

Size	Allocator	Target	Achieved
4KB	Bitmap	<200ns	165ns
2MB	Buddy	<500ns	420ns
1GB	Buddy	<1μs	850ns
NUMA local	Hybrid	<300ns	275ns
NUMA remote	Hybrid	<800ns	750ns

Page Fault Handling

Type	Target	Achieved
Anonymous page	<2μs	1.7μs
File-backed page	<5μs	4.2μs
Copy-on-write	<3μs	2.6μs
Huge page	<10μs	8.5μs

Scheduler Performance

Scheduling Latency

Load	Target	Achieved
Light (10 tasks)	<1μs	0.8μs
Medium (100 tasks)	<2μs	1.6μs
Heavy (1000 tasks)	<5μs	4.1μs
Overload (10K tasks)	<20μs	16μs

Load Balancing

Metric	Target	Achieved
Migration latency	<10μs	8.2μs
Work stealing overhead	<5%	3.8%
Cache efficiency	>90%	92%

I/O Performance

Disk I/O

Using io_uring with registered buffers:

Operation	Size	Target	Achieved
Random read	4KB	15μs	12μs
Random write	4KB	20μs	17μs
Sequential read	1MB	150μs	125μs
Sequential write	1MB	200μs	170μs

Throughput

Workload	Target	Achieved
4KB random read IOPS	>500K	620K
Sequential read	>6GB/s	6.8GB/s
Sequential write	>5GB/s	5.7GB/s

Network I/O

Using kernel bypass (DPDK):

Metric	Target	Achieved
Packet rate (64B)	>50Mpps	62Mpps
Latency (ping-pong)	<5μs	3.8μs
Bandwidth (TCP)	>90Gbps	94Gbps
Connections/sec	>1M	1.3M

Capability System

Operation Costs

Operation	Target	Achieved
Capability creation	<100ns	85ns
Capability validation	<50ns	42ns
Capability derivation	<150ns	130ns
Revocation (single)	<200ns	175ns
Revocation (tree, 100 nodes)	<50μs	38μs

Lookup Performance

With 10,000 capabilities in table:

Operation	Target	Achieved
Hash table lookup	<100ns	78ns
Cache hit	<20ns	15ns
Range check	<50ns	35ns

Benchmark Configurations

Microbenchmarks

#![allow(unused)]
fn main() {
#[bench]
fn bench_syscall_null(b: &mut Bencher) {
    b.iter(|| {
        unsafe { syscall!(SYS_NULL) }
    });
}

#[bench]
fn bench_ipc_roundtrip(b: &mut Bencher) {
    let (send, recv) = create_channel();
    
    b.iter(|| {
        send.send(Message::default()).unwrap();
        recv.receive().unwrap();
    });
}
}

System Benchmarks

#![allow(unused)]
fn main() {
pub struct SystemBenchmark {
    threads: Vec<JoinHandle<()>>,
    metrics: Arc<Metrics>,
}

impl SystemBenchmark {
    pub fn run_mixed_workload(&self) -> BenchResult {
        // 40% CPU bound
        // 30% I/O bound  
        // 20% IPC heavy
        // 10% Memory intensive
        
        let start = Instant::now();
        // ... workload execution
        let duration = start.elapsed();
        
        BenchResult {
            duration,
            throughput: self.metrics.operations() / duration.as_secs_f64(),
            latency_p50: self.metrics.percentile(0.50),
            latency_p99: self.metrics.percentile(0.99),
        }
    }
}
}

Performance Monitoring

Built-in Metrics

#![allow(unused)]
fn main() {
pub fn collect_performance_counters() -> PerfCounters {
    PerfCounters {
        cycles: read_pmc(PMC_CYCLES),
        instructions: read_pmc(PMC_INSTRUCTIONS),
        cache_misses: read_pmc(PMC_CACHE_MISSES),
        branch_misses: read_pmc(PMC_BRANCH_MISSES),
        ipc: instructions as f64 / cycles as f64,
    }
}
}

Continuous Monitoring

#![allow(unused)]
fn main() {
pub struct PerformanceMonitor {
    samplers: Vec<Box<dyn Sampler>>,
    interval: Duration,
}

impl PerformanceMonitor {
    pub async fn run(&mut self) {
        let mut interval = tokio::time::interval(self.interval);
        
        loop {
            interval.tick().await;
            
            for sampler in &mut self.samplers {
                let sample = sampler.sample();
                self.record(sample);
                
                // Alert on regression
                if sample.degraded() {
                    self.alert(sample);
                }
            }
        }
    }
}
}

Optimization Guidelines

Hot Path Optimization

Minimize allocations: Use stack or pre-allocated buffers
Reduce indirection: Direct calls over virtual dispatch
Cache alignment: Align hot data to cache lines
Branch prediction: Organize likely/unlikely paths
SIMD usage: Vectorize where applicable

Example: Fast Path IPC

#![allow(unused)]
fn main() {
#[inline(always)]
pub fn fast_path_send(port: &Port, msg: &Message) -> Result<(), Error> {
    // Check if receiver is waiting (likely)
    if likely(port.has_waiter()) {
        // Direct transfer, no allocation
        let waiter = port.pop_waiter();
        
        // Copy to receiver's registers
        unsafe {
            copy_nonoverlapping(
                msg as *const _ as *const u64,
                waiter.regs_ptr(),
                8, // 64 bytes = 8 u64s
            );
        }
        
        waiter.wake();
        return Ok(());
    }
    
    // Slow path: queue message
    slow_path_send(port, msg)
}
}

Regression Testing

All performance-critical paths have regression tests:

[[bench]]
name = "syscall"
threshold = 50  # nanoseconds
tolerance = 10  # percent

[[bench]]
name = "ipc_latency"  
threshold = 1000  # nanoseconds
tolerance = 15    # percent

Automated CI runs these benchmarks and fails if regression detected.

VeridianOS Developer Guide