Benchmark harness
CodeLoop ships its own benchmark harness so anyone can reproduce the comparison numbers we publish. The harness lives atbenchmarks/buggy-commits-50/ and is the public version of the runbook at docs/BENCHMARK_RUNBOOK.md.
What's in the box
- 50 fixture slots across 5 categories (logic errors, UI regressions, API contract violations, state management bugs, security vulnerabilities). Each fixture is a repo state with a deliberate bug introduced by
bug.patchand a known-good fix infix.patch. - 3 tools under test:
codeloop(deterministic),bugbot(LLM-as-judge, subscription-required), andvanilla(no-op control). - setup.ts / runner.ts / report.ts— provision fixtures, dispatch the tool's command per fixture × tool, and aggregate per-tool pass-rate, mean confidence, p50 / p95 duration, and false-positive / false-negative counts.
Local quickstart
cd benchmarks/buggy-commits-50
npm install
npm run setup
npx vitest run # manifest sanity + Test 41.7 LLM-purity
node --experimental-strip-types runner.ts --tool codeloop --seed 42
node --experimental-strip-types report.tsThe runner writes one row per fixture × tool toresults/run-<ts>/results.jsonl. The report writer picks the most recent run unless --run results/run-<ts>is passed.
CI gating
The workflow at .github/workflows/benchmark.yml only runs when repository variable BENCHMARK_ENABLED=trueOR when dispatched manually with run_anyway=true. This keeps PR feedback fast and avoids burning CI minutes until the fixtures are real bugs. Schedule: weekly Sunday 06:00 UTC. Results are uploaded as a 30-day artifact namedbuggy-commits-50-results-<run-id>.
Network-purity guarantee
A vitest case inbenchmarks/buggy-commits-50/__tests__/harness.test.tswalks every file in the benchmark tree and asserts no file matches/openai|anthropic/i. This is the deterministic, repository-internal version of rg openai|anthropic benchmarks/— and it's enforced on every PR via the harness regression suite. CodeLoop's verification path stays LLM-free, by construction.
Bugbot subscription — flipping it on
The harness already speaks to cursor-bugbot scan --json. When the project owner acquires a Bugbot subscription, the steps are: set the CURSOR_BUGBOT_TOKEN repo secret, dispatch the workflow with tool=all, and inspect the side-by-side report. See docs/BENCHMARK_RUNBOOK.md for the full procedure.
Acceptance for §43
The §43 long-form benchmark post is gated by: (1) all 50 fixtures moved out of skipped: true with a realbug.patch, (2) a manual workflow dispatch withtool=all producing a report with ≥ 80% detected oncodeloop, ≤ 10% FP, and p95 < 2 minutes per fixture. Until then the harness is the public proof that the comparison will be reproducible.