Benchmark harness

CodeLoop ships its own benchmark harness so anyone can reproduce the comparison numbers we publish. The harness lives atbenchmarks/buggy-commits-50/ and is the public version of the runbook at docs/BENCHMARK_RUNBOOK.md.

What's in the box

50 fixture slots across 5 categories (logic errors, UI regressions, API contract violations, state management bugs, security vulnerabilities). Each fixture is a repo state with a deliberate bug introduced by bug.patch and a known-good fix infix.patch.
3 tools under test: codeloop (deterministic), bugbot (LLM-as-judge, subscription-required), and vanilla (no-op control).
setup.ts / runner.ts / report.ts— provision fixtures, dispatch the tool's command per fixture × tool, and aggregate per-tool pass-rate, mean confidence, p50 / p95 duration, and false-positive / false-negative counts.

Local quickstart

cd benchmarks/buggy-commits-50
npm install
npm run setup
npx vitest run                            # manifest sanity + Test 41.7 LLM-purity
node --experimental-strip-types runner.ts --tool codeloop --seed 42
node --experimental-strip-types report.ts

The runner writes one row per fixture × tool toresults/run-<ts>/results.jsonl. The report writer picks the most recent run unless --run results/run-<ts>is passed.

CI gating

The workflow at .github/workflows/benchmark.yml only runs when repository variable BENCHMARK_ENABLED=trueOR when dispatched manually with run_anyway=true. This keeps PR feedback fast and avoids burning CI minutes until the fixtures are real bugs. Schedule: weekly Sunday 06:00 UTC. Results are uploaded as a 30-day artifact namedbuggy-commits-50-results-<run-id>.

Network-purity guarantee

A vitest case inbenchmarks/buggy-commits-50/__tests__/harness.test.tswalks every file in the benchmark tree and asserts no file matches/openai|anthropic/i. This is the deterministic, repository-internal version of rg openai|anthropic benchmarks/— and it's enforced on every PR via the harness regression suite. CodeLoop's verification path stays LLM-free, by construction.

Bugbot subscription — flipping it on

The harness already speaks to cursor-bugbot scan --json. When the project owner acquires a Bugbot subscription, the steps are: set the CURSOR_BUGBOT_TOKEN repo secret, dispatch the workflow with tool=all, and inspect the side-by-side report. See docs/BENCHMARK_RUNBOOK.md for the full procedure.

Acceptance for §43

The §43 long-form benchmark post is gated by: (1) all 50 fixtures moved out of skipped: true with a realbug.patch, (2) a manual workflow dispatch withtool=all producing a report with ≥ 80% detected oncodeloop, ≤ 10% FP, and p95 < 2 minutes per fixture. Until then the harness is the public proof that the comparison will be reproducible.