greywall/docs/benchmarking.md

# Benchmarking

This document describes how to run, interpret, and compare sandbox performance benchmarks for Greywall.

## Quick Start

```bash
# Install dependencies
brew install hyperfine   # macOS
# apt install hyperfine  # Linux

go install golang.org/x/perf/cmd/benchstat@latest

# Run CLI benchmarks
./scripts/benchmark.sh

# Run Go microbenchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...
```

## Goals

1. Quantify sandbox overhead on each platform (`sandboxed / unsandboxed` ratio)
2. Compare macOS (Seatbelt) vs Linux (bwrap+Landlock) overhead fairly
3. Attribute overhead to specific components (proxy startup, bridge setup, wrap generation)
4. Track regressions over time

## Benchmark Types

### Layer 1: CLI Benchmarks (`scripts/benchmark.sh`)

**What it measures**: Real-world agent cost - full `greywall` invocation including proxy startup, socat bridges (Linux), and sandbox-exec/bwrap setup.

This is the most realistic benchmark for understanding the cost of running agent commands through Greywall.

```bash
# Full benchmark suite
./scripts/benchmark.sh

# Quick mode (fewer runs)
./scripts/benchmark.sh -q

# Custom output directory
./scripts/benchmark.sh -o ./my-results

# Include network benchmarks (requires local server)
./scripts/benchmark.sh --network
```

#### Options

| Option | Description |
|--------|-------------|
| `-b, --binary PATH` | Path to greywall binary (default: ./greywall) |
| `-o, --output DIR` | Output directory (default: ./benchmarks) |
| `-n, --runs N` | Minimum runs per benchmark (default: 30) |
| `-q, --quick` | Quick mode: fewer runs, skip slow benchmarks |
| `--network` | Include network benchmarks |

### Layer 2: Go Microbenchmarks (`internal/sandbox/benchmark_test.go`)

**What it measures**: Component-level overhead - isolates Manager initialization, WrapCommand generation, and execution.

```bash
# Run all benchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...

# Run specific benchmark
go test -run=^$ -bench=BenchmarkWarmSandbox -benchmem ./internal/sandbox/...

# Multiple runs for statistical analysis
go test -run=^$ -bench=. -benchmem -count=10 ./internal/sandbox/... > bench.txt
benchstat bench.txt
```

#### Available Benchmarks

| Benchmark | Description |
|-----------|-------------|
| `BenchmarkBaseline_*` | Unsandboxed command execution |
| `BenchmarkManagerInitialize` | Cold initialization (proxies + bridges) |
| `BenchmarkWrapCommand` | Command string construction only |
| `BenchmarkColdSandbox_*` | Full init + wrap + exec per iteration |
| `BenchmarkWarmSandbox_*` | Pre-initialized manager, just exec |
| `BenchmarkOverhead` | Grouped comparison of baseline vs sandbox |

### Layer 3: OS-Level Profiling

**What it measures**: Kernel/system overhead - context switches, syscalls, page faults.

#### Linux

```bash
# Quick syscall cost breakdown
strace -f -c ./greywall -- true

# Context switches, page faults
perf stat -- ./greywall -- true

# Full profiling (flamegraph-ready)
perf record -F 99 -g -- ./greywall -- git status
perf report
```

#### macOS

```bash
# Time Profiler via Instruments
xcrun xctrace record --template 'Time Profiler' --launch -- ./greywall -- true

# Quick call-stack snapshot
./greywall -- sleep 5 &
sample $! 5 -file sample.txt
```

## Interpreting Results

### Key Metric: Overhead Factor

```text
Overhead Factor = time(sandboxed) / time(unsandboxed)
```

Compare overhead factors across platforms, not absolute times, because hardware differences swamp absolute timings.

### Example Output

```text
Benchmark                      Unsandboxed    Sandboxed    Overhead
true                           1.2 ms         45 ms        37.5x
git status                     15 ms          62 ms        4.1x
python -c 'pass'               25 ms          73 ms        2.9x
```

### What to Expect

| Workload | Linux Overhead | macOS Overhead | Notes |
|----------|----------------|----------------|-------|
| `true` | 180-360x | 8-10x | Dominated by cold start |
| `echo` | 150-300x | 6-8x | Similar to true |
| `python3 -c 'pass'` | 10-12x | 2-3x | Interpreter startup dominates |
| `git status` | 50-60x | 4-5x | Real I/O helps amortize |
| `rg` | 40-50x | 3-4x | Search I/O helps amortize |

The overhead factor decreases as the actual workload increases (because sandbox setup is fixed cost). Linux overhead is significantly higher due to bwrap/socat setup.

## Cross-Platform Comparison

### Fair Comparison Approach

1. Run benchmarks on each platform independently
2. Compare overhead factors, not absolute times
3. Use the same greywall version and workloads

```bash
# On macOS
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_macos.txt

# On Linux
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_linux.txt

# Compare
benchstat bench_macos.txt bench_linux.txt
```

### Caveats

- macOS uses Seatbelt (sandbox-exec) - built-in, lightweight kernel sandbox
- Linux uses bwrap + Landlock, this creates socat bridges for network, incurring significant setup cost
- Linux cold start is ~10x slower than macOS due to bwrap/socat bridge setup
- Linux warm path is still ~5x slower than macOS - bwrap execution itself has overhead
- For long-running agents, this difference is negligible (one-time startup cost)

> [!TIP]
> Running Linux benchmarks inside a VM (Colima, Docker Desktop, etc.) inflates overhead due to virtualization. Use native Linux (bare metal or CI) for fair cross-platform comparison.

## GitHub Actions

Benchmarks can be run in CI via the workflow at `.github/workflows/benchmark.yml`:

```bash
# Trigger manually from GitHub UI: Actions > Benchmarks > Run workflow

# Or via gh CLI
gh workflow run benchmark.yml
```

Results are uploaded as artifacts and summarized in the workflow summary.

## Tips

### Reducing Variance

- Run with `--min-runs 50` or higher
- Close other applications
- Pin CPU frequency if possible (Linux: `cpupower frequency-set --governor performance`)
- Run multiple times and use benchstat for statistical analysis

### Profiling Hotspots

```bash
# CPU profile
go test -run=^$ -bench=BenchmarkWarmSandbox -cpuprofile=cpu.out ./internal/sandbox/...
go tool pprof -http=:8080 cpu.out

# Memory profile
go test -run=^$ -bench=BenchmarkWarmSandbox -memprofile=mem.out ./internal/sandbox/...
go tool pprof -http=:8080 mem.out
```

### Tracking Regressions

1. Run benchmarks before and after changes
2. Save results to files
3. Compare with benchstat

```bash
# Before
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > before.txt

# Make changes...

# After
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > after.txt

# Compare
benchstat before.txt after.txt
```

## Workload Categories

| Category | Commands | What it Stresses |
|----------|----------|------------------|
| **Spawn-only** | `true`, `echo` | Process spawn, wrapper overhead |
| **Interpreter** | `python3 -c`, `node -e` | Runtime startup under sandbox |
| **FS-heavy** | file creation, `rg` | Landlock/Seatbelt FS rules |
| **Network (local)** | `curl localhost` | Proxy forwarding overhead |
| **Real tools** | `git status` | Practical agent workloads |

## Benchmark Findings (12/28/2025)

Results from GitHub Actions CI runners (Linux: AMD EPYC 7763, macOS: Apple M1 Virtual).

### Manager Initialization

| Platform | `Manager.Initialize()` |
|----------|------------------------|
| Linux | 101.9 ms |
| macOS | 27.5 µs |

Linux initialization is ~3,700x slower because it must:

- Start HTTP + SOCKS proxies
- Create Unix socket bridges for socat
- Set up bwrap namespace configuration

macOS only generates a Seatbelt profile string (very cheap).

### Cold Start Overhead (one `greywall` invocation per command)

| Workload | Linux | macOS |
|----------|-------|-------|
| `true` | 215 ms | 22 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |

This is the realistic cost for scripts running `greywall -c "command"` repeatedly.

### Warm Path Overhead (pre-initialized manager)

| Workload | Linux | macOS |
|----------|-------|-------|
| `true` | 112 ms | 20 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |

Even with proxies already running, Linux bwrap execution adds ~110ms overhead per command.

### Overhead Factors

| Workload | Linux Overhead | macOS Overhead |
|----------|----------------|----------------|
| `true` (cold) | ~360x | ~10x |
| `true` (warm) | ~187x | ~8x |
| Python (warm) | ~11x | ~2x |
| Git status (warm) | ~54x | ~4x |

Overhead decreases as the actual workload increases (sandbox setup is fixed cost).

## Impact on Agent Usage

### Long-Running Agents (`greywall claude`, `greywall codex`)

For agents that run as a child process under greywall:

| Phase | Cost |
|-------|------|
| Startup (once) | Linux: ~215ms, macOS: ~22ms |
| Per tool call | Negligible (baseline fork+exec only) |

Child processes inherit the sandbox - no re-initialization, no WrapCommand overhead. The per-command cost is just normal process spawning:

| Command | Linux | macOS |
|---------|-------|-------|
| `true` | 0.6 ms | 2.3 ms |
| `git status` | 2.1 ms | 5.9 ms |
| Python script | 11 ms | 15 ms |

**Bottom line**: For `greywall <agent>` usage, sandbox overhead is a one-time startup cost. Tool calls inside the agent run at native speed.

### Per-Command Invocation (`greywall -c "command"`)

For scripts or CI running greywall per command:

| Session | Linux Cost | macOS Cost |
|---------|------------|------------|
| 1 command | 215 ms | 22 ms |
| 10 commands | 2.15 s | 220 ms |
| 50 commands | 10.75 s | 1.1 s |

Consider keeping the manager alive (daemon mode) or batching commands to reduce overhead.

## Additional Notes

- `Manager.Initialize()` starts HTTP + SOCKS proxies; on Linux also creates socat bridges
- Cold start includes all initialization; hot path is just `WrapCommand + exec`
- `-m` (monitor mode) spawns additional monitoring processes, so we'll have to benchmark separately
- Keep workloads under the repo - avoid `/tmp` since Linux bwrap does `--tmpfs /tmp`
- `debug` mode changes logging, so always benchmark with debug off