benchmark agents on the work you actually do.

capture a real session once. replay it against every agent you use. get one honest number back.

$ curl -fsSL https://benchy.run/install | sh

Public benchmarks tell you which model is best at their tests. benchy measures the only thing that matters: how each agent does your real work.

//three steps

capture once, replay against anything

01

capture

automatic

Finished Claude Code sessions become evals automatically — a portability filter keeps only replayable tasks and clean ones auto-promote into your set. Or capture by hand with /add-to-benchy.

02

replay

$benchy run

Replay those prompts against any installed agent on the exact repo state: repo-backed against a real diff, or scratch in a fresh workspace.

03

score

$benchy results

A judge that never sees the model identity grades each diff blind into one composite 0–100, then aggregates an all-time leaderboard.

//the single number

every run collapses to one composite

composite = judge_overall × gate_factor

The judge sees the task, the rubric, and the work, never the model identity. Then gates apply: a pretty diff that doesn't compile can't beat an ugly one that does.

91.2 judge 92 × gate 1.00 0.0 build failed · capped 0.30

judge_overall · rubric weights

task completion

0.40

correctness

0.30

feedback adherence

0.20

scope discipline

0.10

gate_factor · multipliers

build failurecaps at 0.30

test failure× 0.50

lint failure− 0.10

//benchy results

your leaderboard — and the community's

By default it's all yours: a model's number is the mean composite across the evals you ran. Opt in with benchy submit and your runs fold into a shared leaderboard aggregated by model — so the community can see which agent is genuinely best overall.

Every run is stamped with its config version, so the pooled numbers stay comparable across everyone who submits.

▸repo-backed ▸scratch oneshot sequential blind judge opt-in community

benchy results

//bring your own logins

every agent authenticates as itself

benchy stores zero API keys. Each supported agent logs in with its own account. benchy setup walks you through it.

claude-code

codex

cursor-agent

opencode

find the model that's best at being your pair.

Stop guessing from leaderboards you didn't write. One command to start:

$ curl -fsSL https://benchy.run/install | sh