This repository has been archived on 2026-03-13. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
greywall/docs/benchmarking.md
Mathieu Virbel da3a2ac3a4 rename Fence to Greywall as GreyHaven sandboxing component
Rebrand the project from Fence to Greywall, the sandboxing layer of the
GreyHaven platform. This updates:

- Go module path to gitea.app.monadical.io/monadical/greywall
- Binary name, CLI help text, and all usage examples
- Config paths (~/.config/greywall/greywall.json), env vars (GREYWALL_*)
- Log prefixes ([greywall:*]), temp file prefixes (greywall-*)
- All documentation, scripts, CI workflows, and example files
- README rewritten with GreyHaven branding and Fence attribution

Directory/file renames: cmd/fence → cmd/greywall, pkg/fence → pkg/greywall,
docs/why-fence.md → docs/why-greywall.md, example JSON files, and banner.
2026-02-10 16:00:24 -06:00

330 lines
9.9 KiB
Markdown

# Benchmarking
This document describes how to run, interpret, and compare sandbox performance benchmarks for Greywall.
## Quick Start
```bash
# Install dependencies
brew install hyperfine # macOS
# apt install hyperfine # Linux
go install golang.org/x/perf/cmd/benchstat@latest
# Run CLI benchmarks
./scripts/benchmark.sh
# Run Go microbenchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...
```
## Goals
1. Quantify sandbox overhead on each platform (`sandboxed / unsandboxed` ratio)
2. Compare macOS (Seatbelt) vs Linux (bwrap+Landlock) overhead fairly
3. Attribute overhead to specific components (proxy startup, bridge setup, wrap generation)
4. Track regressions over time
## Benchmark Types
### Layer 1: CLI Benchmarks (`scripts/benchmark.sh`)
**What it measures**: Real-world agent cost - full `greywall` invocation including proxy startup, socat bridges (Linux), and sandbox-exec/bwrap setup.
This is the most realistic benchmark for understanding the cost of running agent commands through Greywall.
```bash
# Full benchmark suite
./scripts/benchmark.sh
# Quick mode (fewer runs)
./scripts/benchmark.sh -q
# Custom output directory
./scripts/benchmark.sh -o ./my-results
# Include network benchmarks (requires local server)
./scripts/benchmark.sh --network
```
#### Options
| Option | Description |
|--------|-------------|
| `-b, --binary PATH` | Path to greywall binary (default: ./greywall) |
| `-o, --output DIR` | Output directory (default: ./benchmarks) |
| `-n, --runs N` | Minimum runs per benchmark (default: 30) |
| `-q, --quick` | Quick mode: fewer runs, skip slow benchmarks |
| `--network` | Include network benchmarks |
### Layer 2: Go Microbenchmarks (`internal/sandbox/benchmark_test.go`)
**What it measures**: Component-level overhead - isolates Manager initialization, WrapCommand generation, and execution.
```bash
# Run all benchmarks
go test -run=^$ -bench=. -benchmem ./internal/sandbox/...
# Run specific benchmark
go test -run=^$ -bench=BenchmarkWarmSandbox -benchmem ./internal/sandbox/...
# Multiple runs for statistical analysis
go test -run=^$ -bench=. -benchmem -count=10 ./internal/sandbox/... > bench.txt
benchstat bench.txt
```
#### Available Benchmarks
| Benchmark | Description |
|-----------|-------------|
| `BenchmarkBaseline_*` | Unsandboxed command execution |
| `BenchmarkManagerInitialize` | Cold initialization (proxies + bridges) |
| `BenchmarkWrapCommand` | Command string construction only |
| `BenchmarkColdSandbox_*` | Full init + wrap + exec per iteration |
| `BenchmarkWarmSandbox_*` | Pre-initialized manager, just exec |
| `BenchmarkOverhead` | Grouped comparison of baseline vs sandbox |
### Layer 3: OS-Level Profiling
**What it measures**: Kernel/system overhead - context switches, syscalls, page faults.
#### Linux
```bash
# Quick syscall cost breakdown
strace -f -c ./greywall -- true
# Context switches, page faults
perf stat -- ./greywall -- true
# Full profiling (flamegraph-ready)
perf record -F 99 -g -- ./greywall -- git status
perf report
```
#### macOS
```bash
# Time Profiler via Instruments
xcrun xctrace record --template 'Time Profiler' --launch -- ./greywall -- true
# Quick call-stack snapshot
./greywall -- sleep 5 &
sample $! 5 -file sample.txt
```
## Interpreting Results
### Key Metric: Overhead Factor
```text
Overhead Factor = time(sandboxed) / time(unsandboxed)
```
Compare overhead factors across platforms, not absolute times, because hardware differences swamp absolute timings.
### Example Output
```text
Benchmark Unsandboxed Sandboxed Overhead
true 1.2 ms 45 ms 37.5x
git status 15 ms 62 ms 4.1x
python -c 'pass' 25 ms 73 ms 2.9x
```
### What to Expect
| Workload | Linux Overhead | macOS Overhead | Notes |
|----------|----------------|----------------|-------|
| `true` | 180-360x | 8-10x | Dominated by cold start |
| `echo` | 150-300x | 6-8x | Similar to true |
| `python3 -c 'pass'` | 10-12x | 2-3x | Interpreter startup dominates |
| `git status` | 50-60x | 4-5x | Real I/O helps amortize |
| `rg` | 40-50x | 3-4x | Search I/O helps amortize |
The overhead factor decreases as the actual workload increases (because sandbox setup is fixed cost). Linux overhead is significantly higher due to bwrap/socat setup.
## Cross-Platform Comparison
### Fair Comparison Approach
1. Run benchmarks on each platform independently
2. Compare overhead factors, not absolute times
3. Use the same greywall version and workloads
```bash
# On macOS
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_macos.txt
# On Linux
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > bench_linux.txt
# Compare
benchstat bench_macos.txt bench_linux.txt
```
### Caveats
- macOS uses Seatbelt (sandbox-exec) - built-in, lightweight kernel sandbox
- Linux uses bwrap + Landlock, this creates socat bridges for network, incurring significant setup cost
- Linux cold start is ~10x slower than macOS due to bwrap/socat bridge setup
- Linux warm path is still ~5x slower than macOS - bwrap execution itself has overhead
- For long-running agents, this difference is negligible (one-time startup cost)
> [!TIP]
> Running Linux benchmarks inside a VM (Colima, Docker Desktop, etc.) inflates overhead due to virtualization. Use native Linux (bare metal or CI) for fair cross-platform comparison.
## GitHub Actions
Benchmarks can be run in CI via the workflow at `.github/workflows/benchmark.yml`:
```bash
# Trigger manually from GitHub UI: Actions > Benchmarks > Run workflow
# Or via gh CLI
gh workflow run benchmark.yml
```
Results are uploaded as artifacts and summarized in the workflow summary.
## Tips
### Reducing Variance
- Run with `--min-runs 50` or higher
- Close other applications
- Pin CPU frequency if possible (Linux: `cpupower frequency-set --governor performance`)
- Run multiple times and use benchstat for statistical analysis
### Profiling Hotspots
```bash
# CPU profile
go test -run=^$ -bench=BenchmarkWarmSandbox -cpuprofile=cpu.out ./internal/sandbox/...
go tool pprof -http=:8080 cpu.out
# Memory profile
go test -run=^$ -bench=BenchmarkWarmSandbox -memprofile=mem.out ./internal/sandbox/...
go tool pprof -http=:8080 mem.out
```
### Tracking Regressions
1. Run benchmarks before and after changes
2. Save results to files
3. Compare with benchstat
```bash
# Before
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > before.txt
# Make changes...
# After
go test -run=^$ -bench=. -count=10 ./internal/sandbox/... > after.txt
# Compare
benchstat before.txt after.txt
```
## Workload Categories
| Category | Commands | What it Stresses |
|----------|----------|------------------|
| **Spawn-only** | `true`, `echo` | Process spawn, wrapper overhead |
| **Interpreter** | `python3 -c`, `node -e` | Runtime startup under sandbox |
| **FS-heavy** | file creation, `rg` | Landlock/Seatbelt FS rules |
| **Network (local)** | `curl localhost` | Proxy forwarding overhead |
| **Real tools** | `git status` | Practical agent workloads |
## Benchmark Findings (12/28/2025)
Results from GitHub Actions CI runners (Linux: AMD EPYC 7763, macOS: Apple M1 Virtual).
### Manager Initialization
| Platform | `Manager.Initialize()` |
|----------|------------------------|
| Linux | 101.9 ms |
| macOS | 27.5 µs |
Linux initialization is ~3,700x slower because it must:
- Start HTTP + SOCKS proxies
- Create Unix socket bridges for socat
- Set up bwrap namespace configuration
macOS only generates a Seatbelt profile string (very cheap).
### Cold Start Overhead (one `greywall` invocation per command)
| Workload | Linux | macOS |
|----------|-------|-------|
| `true` | 215 ms | 22 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |
This is the realistic cost for scripts running `greywall -c "command"` repeatedly.
### Warm Path Overhead (pre-initialized manager)
| Workload | Linux | macOS |
|----------|-------|-------|
| `true` | 112 ms | 20 ms |
| Python | 124 ms | 33 ms |
| Git status | 114 ms | 25 ms |
Even with proxies already running, Linux bwrap execution adds ~110ms overhead per command.
### Overhead Factors
| Workload | Linux Overhead | macOS Overhead |
|----------|----------------|----------------|
| `true` (cold) | ~360x | ~10x |
| `true` (warm) | ~187x | ~8x |
| Python (warm) | ~11x | ~2x |
| Git status (warm) | ~54x | ~4x |
Overhead decreases as the actual workload increases (sandbox setup is fixed cost).
## Impact on Agent Usage
### Long-Running Agents (`greywall claude`, `greywall codex`)
For agents that run as a child process under greywall:
| Phase | Cost |
|-------|------|
| Startup (once) | Linux: ~215ms, macOS: ~22ms |
| Per tool call | Negligible (baseline fork+exec only) |
Child processes inherit the sandbox - no re-initialization, no WrapCommand overhead. The per-command cost is just normal process spawning:
| Command | Linux | macOS |
|---------|-------|-------|
| `true` | 0.6 ms | 2.3 ms |
| `git status` | 2.1 ms | 5.9 ms |
| Python script | 11 ms | 15 ms |
**Bottom line**: For `greywall <agent>` usage, sandbox overhead is a one-time startup cost. Tool calls inside the agent run at native speed.
### Per-Command Invocation (`greywall -c "command"`)
For scripts or CI running greywall per command:
| Session | Linux Cost | macOS Cost |
|---------|------------|------------|
| 1 command | 215 ms | 22 ms |
| 10 commands | 2.15 s | 220 ms |
| 50 commands | 10.75 s | 1.1 s |
Consider keeping the manager alive (daemon mode) or batching commands to reduce overhead.
## Additional Notes
- `Manager.Initialize()` starts HTTP + SOCKS proxies; on Linux also creates socat bridges
- Cold start includes all initialization; hot path is just `WrapCommand + exec`
- `-m` (monitor mode) spawns additional monitoring processes, so we'll have to benchmark separately
- Keep workloads under the repo - avoid `/tmp` since Linux bwrap does `--tmpfs /tmp`
- `debug` mode changes logging, so always benchmark with debug off