benchmark agents on the work you actually do.

capture a real session once. replay it against every agent you use. get one honest number back.

$ curl -fsSL https://benchy.run/install | sh

Public benchmarks tell you which model is best at their tests. benchy measures the only thing that matters: how each agent does your real work.

//three steps

capture once, replay against anything

01

capture

/add-to-benchy

Turn the Claude Code session you just had into an eval (its prompts, repo, and commit) as one markdown file under ~/.config/benchy/snapshots.

02

replay

$benchy run

Replay those prompts against any installed agent on the exact repo state: repo-backed against a real diff, or scratch in a fresh workspace.

03

score

$benchy results

A judge that never sees the model identity grades each diff blind into one composite 0–100, then aggregates an all-time leaderboard.

//the single number

every run collapses to one composite

composite = judge_overall × gate_factor

The judge sees the task, the rubric, and the work, never the model identity. Then gates apply: a pretty diff that doesn't compile can't beat an ugly one that does.

91.2 judge 92 × gate 1.00 0.0 build failed · capped 0.30
judge_overall · rubric weights
task completion
0.40
correctness
0.30
feedback adherence
0.20
scope discipline
0.10
gate_factor · multipliers
build failurecaps at 0.30
test failure× 0.50
lint failure− 0.10
//benchy results

your leaderboard — and the community's

By default it's all yours: a model's number is the mean composite across the evals you ran. Opt in with benchy submit and your runs fold into a shared leaderboard aggregated by model — so the community can see which agent is genuinely best overall.

Every run is stamped with its config version, so the pooled numbers stay comparable across everyone who submits.

repo-backed scratch oneshot sequential blind judge opt-in community
benchy results
//bring your own logins

every agent authenticates as itself

benchy stores zero API keys. Each supported agent logs in with its own account. benchy setup walks you through it.

claude-code
codex
cursor-agent
opencode

find the model that's best at being your pair.

Stop guessing from leaderboards you didn't write. One command to start:

$ curl -fsSL https://benchy.run/install | sh