# Benchmarking This document describes how to run, interpret, and compare sandbox performance benchmarks for Greywall. ## Quick Start ```bash # Install dependencies brew install hyperfine # macOS # apt install hyperfine # Linux go install golang.org/x/perf/cmd/benchstat@latest # Run CLI benchmarks ./scripts/benchmark.sh # Run Go microbenchmarks go test -run=^$ -bench=. -benchmem ./internal/sandbox/... ``` ## Goals 1. Quantify sandbox overhead on each platform (`sandboxed / unsandboxed` ratio) 2. Compare macOS (Seatbelt) vs Linux (bwrap+Landlock) overhead fairly 3. Attribute overhead to specific components (proxy startup, bridge setup, wrap generation) 4. Track regressions over time ## Benchmark Types ### Layer 1: CLI Benchmarks (`scripts/benchmark.sh`) **What it measures**: Real-world agent cost - full `greywall` invocation including proxy startup, socat bridges (Linux), and sandbox-exec/bwrap setup. This is the most realistic benchmark for understanding the cost of running agent commands through Greywall. ```bash # Full benchmark suite ./scripts/benchmark.sh # Quick mode (fewer runs) ./scripts/benchmark.sh -q # Custom output directory ./scripts/benchmark.sh -o ./my-results # Include network benchmarks (requires local server) ./scripts/benchmark.sh --network ``` #### Options | Option | Description | |--------|-------------| | `-b, --binary PATH` | Path to greywall binary (default: ./greywall) | | `-o, --output DIR` | Output directory (default: ./benchmarks) | | `-n, --runs N` | Minimum runs per benchmark (default: 30) | | `-q, --quick` | Quick mode: fewer runs, skip slow benchmarks | | `--network` | Include network benchmarks | ### Layer 2: Go Microbenchmarks (`internal/sandbox/benchmark_test.go`) **What it measures**: Component-level overhead - isolates Manager initialization, WrapCommand generation, and execution. ```bash # Run all benchmarks go test -run=^$ -bench=. -benchmem ./internal/sandbox/... # Run specific benchmark go test -run=^$ -bench=BenchmarkWarmSandbox -benchmem ./internal/sandbox/... # Multiple runs for statistical analysis go test -run=^$ -bench=. -benchmem -count=10 ./internal/sandbox/... > bench.txt benchstat bench.txt ``` #### Available Benchmarks | Benchmark | Description | |-----------|-------------| | `BenchmarkBaseline_*` | Unsandboxed command execution | | `BenchmarkManagerInitialize` | Cold initialization (proxies + bridges) | | `BenchmarkWrapCommand` | Command string construction only | | `BenchmarkColdSandbox_*` | Full init + wrap + exec per iteration | | `BenchmarkWarmSandbox_*` | Pre-initialized manager, just exec | | `BenchmarkOverhead` | Grouped comparison of baseline vs sandbox | ### Layer 3: OS-Level Profiling **What it measures**: Kernel/system overhead - context switches, syscalls, page faults. #### Linux ```bash # Quick syscall cost breakdown strace -f -c ./greywall -- true # Context switches, page faults perf stat -- ./greywall -- true # Full profiling (flamegraph-ready) perf record -F 99 -g -- ./greywall -- git status perf report ``` #### macOS ```bash # Time Profiler via Instruments xcrun xctrace record --template 'Time Profiler' --launch -- ./greywall -- true # Quick call-stack snapshot ./greywall -- sleep 5 & sample $! 5 -file sample.txt ``` ## Interpreting Results ### Key Metric: Overhead Factor ```text Overhead Factor = time(sandboxed) / time(unsandboxed) ``` Compare overhead factors across platforms, not absolute times, because hardware differences swamp absolute timings. ### Example Output ```text Benchmark Unsandboxed Sandboxed Overhead true 1.2 ms 45 ms 37.5x git status 15 ms 62 ms 4.1x python -c 'pass' 25 ms 73 ms 2.9x ``` ### What to Expect | Workload | Linux Overhead | macOS Overhead | Notes | |----------|----------------|----------------|-------| | `true` | 180-360x | 8-10x | Dominated by cold start | | `echo` | 150-300x | 6-8x | Similar to true | | `python3 -c 'pass'` | 10-12x | 2-3x | Interpreter startup dominates | | `git status` | 50-60x | 4-5x | Real I/O helps amortize | | `rg` | 40-50x | 3-4x | Search I/O helps amortize | The overhead factor decreases as the actual workload increases (because sandbox setup is fixed cost). Linux overhead is significantly higher due to bwrap/socat setup. ## Cross-Platform Comparison ### Fair Comparison Approach 1. Run benchmarks on each platform independently 2. Compare overhead factors, not absolute times 3. Use the same greywall version and workloads ```bash # On macOS go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_macos.txt # On Linux go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_linux.txt # Compare benchstat bench_macos.txt bench_linux.txt ``` ### Caveats - macOS uses Seatbelt (sandbox-exec) - built-in, lightweight kernel sandbox - Linux uses bwrap + Landlock, this creates socat bridges for network, incurring significant setup cost - Linux cold start is ~10x slower than macOS due to bwrap/socat bridge setup - Linux warm path is still ~5x slower than macOS - bwrap execution itself has overhead - For long-running agents, this difference is negligible (one-time startup cost) > [!TIP] > Running Linux benchmarks inside a VM (Colima, Docker Desktop, etc.) inflates overhead due to virtualization. Use native Linux (bare metal or CI) for fair cross-platform comparison. ## GitHub Actions Benchmarks can be run in CI via the workflow at `.github/workflows/benchmark.yml`: ```bash # Trigger manually from GitHub UI: Actions > Benchmarks > Run workflow # Or via gh CLI gh workflow run benchmark.yml ``` Results are uploaded as artifacts and summarized in the workflow summary. ## Tips ### Reducing Variance - Run with `--min-runs 50` or higher - Close other applications - Pin CPU frequency if possible (Linux: `cpupower frequency-set --governor performance`) - Run multiple times and use benchstat for statistical analysis ### Profiling Hotspots ```bash # CPU profile go test -run=^$ -bench=BenchmarkWarmSandbox -cpuprofile=cpu.out ./internal/sandbox/... go tool pprof -http=:8080 cpu.out # Memory profile go test -run=^$ -bench=BenchmarkWarmSandbox -memprofile=mem.out ./internal/sandbox/... go tool pprof -http=:8080 mem.out ``` ### Tracking Regressions 1. Run benchmarks before and after changes 2. Save results to files 3. Compare with benchstat ```bash # Before go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > before.txt # Make changes... # After go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > after.txt # Compare benchstat before.txt after.txt ``` ## Workload Categories | Category | Commands | What it Stresses | |----------|----------|------------------| | **Spawn-only** | `true`, `echo` | Process spawn, wrapper overhead | | **Interpreter** | `python3 -c`, `node -e` | Runtime startup under sandbox | | **FS-heavy** | file creation, `rg` | Landlock/Seatbelt FS rules | | **Network (local)** | `curl localhost` | Proxy forwarding overhead | | **Real tools** | `git status` | Practical agent workloads | ## Benchmark Findings (12/28/2025) Results from GitHub Actions CI runners (Linux: AMD EPYC 7763, macOS: Apple M1 Virtual). ### Manager Initialization | Platform | `Manager.Initialize()` | |----------|------------------------| | Linux | 101.9 ms | | macOS | 27.5 µs | Linux initialization is ~3,700x slower because it must: - Start HTTP + SOCKS proxies - Create Unix socket bridges for socat - Set up bwrap namespace configuration macOS only generates a Seatbelt profile string (very cheap). ### Cold Start Overhead (one `greywall` invocation per command) | Workload | Linux | macOS | |----------|-------|-------| | `true` | 215 ms | 22 ms | | Python | 124 ms | 33 ms | | Git status | 114 ms | 25 ms | This is the realistic cost for scripts running `greywall -c "command"` repeatedly. ### Warm Path Overhead (pre-initialized manager) | Workload | Linux | macOS | |----------|-------|-------| | `true` | 112 ms | 20 ms | | Python | 124 ms | 33 ms | | Git status | 114 ms | 25 ms | Even with proxies already running, Linux bwrap execution adds ~110ms overhead per command. ### Overhead Factors | Workload | Linux Overhead | macOS Overhead | |----------|----------------|----------------| | `true` (cold) | ~360x | ~10x | | `true` (warm) | ~187x | ~8x | | Python (warm) | ~11x | ~2x | | Git status (warm) | ~54x | ~4x | Overhead decreases as the actual workload increases (sandbox setup is fixed cost). ## Impact on Agent Usage ### Long-Running Agents (`greywall claude`, `greywall codex`) For agents that run as a child process under greywall: | Phase | Cost | |-------|------| | Startup (once) | Linux: ~215ms, macOS: ~22ms | | Per tool call | Negligible (baseline fork+exec only) | Child processes inherit the sandbox - no re-initialization, no WrapCommand overhead. The per-command cost is just normal process spawning: | Command | Linux | macOS | |---------|-------|-------| | `true` | 0.6 ms | 2.3 ms | | `git status` | 2.1 ms | 5.9 ms | | Python script | 11 ms | 15 ms | **Bottom line**: For `greywall ` usage, sandbox overhead is a one-time startup cost. Tool calls inside the agent run at native speed. ### Per-Command Invocation (`greywall -c "command"`) For scripts or CI running greywall per command: | Session | Linux Cost | macOS Cost | |---------|------------|------------| | 1 command | 215 ms | 22 ms | | 10 commands | 2.15 s | 220 ms | | 50 commands | 10.75 s | 1.1 s | Consider keeping the manager alive (daemon mode) or batching commands to reduce overhead. ## Additional Notes - `Manager.Initialize()` starts HTTP + SOCKS proxies; on Linux also creates socat bridges - Cold start includes all initialization; hot path is just `WrapCommand + exec` - `-m` (monitor mode) spawns additional monitoring processes, so we'll have to benchmark separately - Keep workloads under the repo - avoid `/tmp` since Linux bwrap does `--tmpfs /tmp` - `debug` mode changes logging, so always benchmark with debug off