capture a real session once. replay it against every agent you use. get one honest number back.
Public benchmarks tell you which model is best at their tests. benchy measures the only thing that matters: how each agent does your real work.
Turn the Claude Code session you just had into an eval (its prompts, repo, and commit) as one markdown file under ~/.config/benchy/snapshots.
Replay those prompts against any installed agent on the exact repo state: repo-backed against a real diff, or scratch in a fresh workspace.
A judge that never sees the model identity grades each diff blind into one composite 0–100, then aggregates an all-time leaderboard.
The judge sees the task, the rubric, and the work, never the model identity. Then gates apply: a pretty diff that doesn't compile can't beat an ugly one that does.
By default it's all yours: a model's number is the mean composite across the evals you ran. Opt in with benchy submit and your runs fold into a shared leaderboard aggregated by model — so the community can see which agent is genuinely best overall.
Every run is stamped with its config version, so the pooled numbers stay comparable across everyone who submits.
benchy stores zero API keys. Each supported agent logs in with its own account. benchy setup walks you through it.
Stop guessing from leaderboards you didn't write. One command to start: