TRICKS·0 of 3 min
Tricks

I tested 7 AI coding agents in one afternoon. Only 2 finished.

Same React refactor. Same Sonnet 4.6 endpoint. Claude Code and Cursor one-shot it. Devin spent 87 minutes and broke unrelated tests. Here are the real results, by the run, not by the benchmark. Share this: Share on X (Opens in new windo

T
Tony Stark
Contributor · 3 min · 2 days ago
Photo · Editorial · MINSTANTS Studio
● Listen · narrated by the editor
14:22
Chapters
  • 01Real-world test, not a benchmark: refactor a 340-line React component, 7 agents, one shot each.
  • 02Claude Code and Cursor both passed on the first try. Codex, Cline, and Aider were partial.
  • 03Devin and Continue failed. Devin spent 87 minutes and broke unrelated tests.
  • 04The agents that won did less. The agents that lost did more.
  • 05SWE-bench scores predicted almost nothing about real-repo behavior.

I gave seven AI coding agents the exact same chore on a Tuesday afternoon. Refactor a 340-line React component out of one big Context provider into three useReducer-backed slices. Same repo, same prompt, same Claude 4.6 Sonnet endpoint where the agent allowed me to bring my own model.

Two of them shipped working code on the first try. One of them shipped working code on the third. Four of them shipped something that compiled and broke the tests in a way that took me longer to diagnose than just rewriting it myself.

This is the part of the AI coding wars no benchmark captures. Not the SWE-bench number. Not the marketing screenshot. The actual middle-of-the-day, did-the-thing-work test.

2 / 7
Agents that one-shot the refactor
87 min
Longest agent total time, end-to-end
$11.42
Total API spend for all seven runs combined

The setup, no kidding-around

The repo: a real internal-tools Next.js 15 app I’ve been tending for six months. The component: a single file at src/components/DashboardProvider.tsx, 340 lines, four contexts inside one provider, with selectors that re-rendered every consumer on every change. Classic 2023 sin, well-documented in my own pull request history.

The prompt: “Split DashboardProvider into three useReducer-backed contexts: FiltersContext, SelectionContext, and PreferencesContext. Preserve the existing public API. Update consumers. Keep tests green.”

The scoring rules I held myself to:

  • Pass: every existing test green, no TypeScript errors, no behavioral regression I could spot in 90 seconds of click-testing.
  • Partial: shipped something that compiled but needed one or two human edits to pass tests.
  • Fail: shipped something I couldn’t fix in under 10 minutes of human time.

I ran each agent once. No second tries. No “regenerate.” The whole point was to see what they do when you treat them like a coworker who only gets one shot, not a slot machine you keep pulling.

Side view of multiple laptops showing code editors and terminals
Seven agents, one component, one afternoon. The kind of bake-off no benchmark publishes. · Pexels

The ranked verdict

My ranking after one afternoon
Cherry-picked nothing. Listed in the order they finished.

1. Claude Code (Sonnet 4.6) — Pass

Read every relevant file before touching anything. Asked me one clarifying question about whether PreferencesContext should persist to localStorage (yes). Shipped three new files, a refactor of the consumers, and a passing test suite in about 14 minutes. Cost: $2.31 in API spend. The diff was the cleanest of the seven and the only one that bothered to add a brief comment explaining why each split was justified.

This is the boring answer. It is also the right answer in my test.

2. Cursor (Agent mode, Sonnet 4.6 forced) — Pass

Faster than Claude Code (11 minutes). The reads were aggressive — it pulled in twelve files where Claude Code pulled in four. Diff was about 30% noisier with reorganization the prompt didn’t ask for. But it worked. Tests green. I had to manually accept one diff where it tried to rename a prop, which is annoying but not fatal.

If you’re the kind of engineer who likes a chatty intern, this is your tool. If you’re not, the noise will grate.

3. Codex CLI (GPT-5.5 default) — Partial

Strong start. Codex actually surfaced a side issue I’d ignored — the existing tests didn’t cover one of the contexts I wanted to split — and offered to write the missing test before doing the refactor. I declined (out of scope). It shipped the refactor in 18 minutes, but it forgot to update one consumer in a deeply-nested route file. The tests still passed because no test exercised that route. Caught it when I clicked through.

Codex is, in my test, the agent most likely to tell you what you don’t want to hear. That is genuinely valuable. It cost me 22 minutes of human review to catch the miss.

4. Cline (Sonnet 4.6, BYOK) — Partial

The open-source dark horse. Cline’s plan-then-act mode is genuinely excellent — it produced a written plan, asked me to approve it, then executed cleanly. The refactor was correct. The problem was the test setup. Cline didn’t realize I was using Vitest with a custom setupTests.ts file, and tried to patch a Jest-style mock that doesn’t exist in my repo. Five minutes to fix.

If you’re running a stock CRA repo, Cline probably one-shots this. On anything bespoke, you’re a coworker, not a customer.

CLAUDE CODE
14 min · $2.31
One-shot pass, clean diff, asked a clarifying question
VS
DEVIN
87 min · $4.10 + ACUs
Refactored three files, broke two unrelated ones, never recovered

5. Aider (Sonnet 4.6, BYOK) — Partial

Aider is the git-native old soul of this list. The diff format is gorgeous. The commits are atomic. The agent is also the most likely to ask you to manually add files to the chat — and if you skip a file it needs, the refactor will be technically wrong in a way that looks technically right.

I forgot to add my types.ts to chat. Aider shipped a refactor that compiled because the inferred types happened to match. It also silently dropped two optional props that lived only in the type definition. Tests passed. Behavior changed. (This is the kind of bug you only catch in code review or in production. Guess which one I caught it in.)

AD
p
Partner deal
PriceLens — usage-pricing in 4 lines
$0 forever for <1M events · partner deal
● Reading this? You'll like these

On the same beat.

6. Continue (open-source, GPT-4.1) — Fail

I like Continue. I host my own models with it. For this test I let it run with GPT-4.1 and the result was a refactor that introduced a circular import between two of the new context files. Tests broke. The error was clear. The fix Continue suggested when I pasted the error back made the circular import worse. After two more rounds I gave up.

This is probably more about GPT-4.1’s structural reasoning than about Continue itself, but in my test it failed and I can’t pretend otherwise.

7. Devin — Fail

I really wanted Devin to be good. The marketing is so confident. The video so smooth.

What I got was an 87-minute session where Devin spun up a sandboxed environment, ran npm install twice, decided my tsconfig.json needed updating (it didn’t), modified five files unrelated to the task, and produced a final PR that touched DashboardProvider almost as an afterthought. The refactor itself was 60% there. The unrelated changes broke two existing tests. The total ACU spend was substantial enough that I’m reluctant to publish it without double-checking the meter.

The marketing positions Devin as autonomous. In my test, it was autonomous in the way a labrador puppy is autonomous. Lots of motion. Different result than asked for.

What the SWE-bench numbers don’t tell you

Claude Opus 4.7 is reportedly at 87.6% on SWE-Bench Verified as of this month. GPT-5.5 is somewhere around 82%. Devin reports 67% PR merge on “defined tasks.” Read those numbers and you’d assume all three are basically interchangeable for a refactor like mine.

They are not interchangeable. The benchmarks measure whether the model can solve a problem in a curated, well-scoped environment with a known test suite. My test measured whether the agent can solve a problem when the test suite is incomplete, the file structure is messy, and the person at the keyboard is busy reading email.

PUBLIC BENCHMARK CONTEXT
Claude Opus 4.7: 87.6% on SWE-Bench Verified, leading the leaderboard. GPT-5.5: ~82%. Devin: ~67% on defined tasks. None of those numbers tell you which agent will not break unrelated tests in your specific repo.
Sources: localaimaster.com, vals.ai/benchmarks/swebench, lowcode.agency · May 2026
→ SWE-bench Verified leaderboard

The benchmark gap between the top three is small. The real-world gap, for my one task, was enormous.

The pattern I noticed

This matches what every senior engineer I’ve worked with for fifteen years would say about junior engineers. The good ones know when to stop. The bad ones treat every ticket like an opportunity to refactor the universe.

The pricing also matters. Claude Code’s $2.31 for a 14-minute pass-on-first-try beats Devin’s $4-plus-ACUs for an 87-minute fail. The economics of “good and fast” massively undercut the economics of “autonomous and unsupervised” once you count the human cleanup time.

What I’m probably wrong about

One test, one component, one engineer. This is not a benchmark. It is one afternoon. The results would absolutely shift if:

  • The task was greenfield. (Devin and Cursor probably climb. Aider stays the same.)
  • The codebase was huge. (Aider and Cline both have better long-context strategies than I gave them credit for.)
  • The model was Opus 4.7 instead of Sonnet 4.6. (Possibly closes the Devin gap.)
  • The user was less experienced. (Cursor’s chattiness becomes a feature, not a bug.)

I also didn’t test Windsurf, Zed’s new agent, or Amp. The selection was mine. The omissions are mine. I’ll get to them next month.

Two developers reviewing code on a single laptop screen
The next test should include Windsurf and Amp. Comments open. Receipts welcome. · Pexels

The take I’d bet on

For the kind of task most senior engineers do most days — focused refactors, contained bug fixes, small features in well-tested code — the boring tools beat the autonomous tools. Claude Code or Cursor or Cline, used like a fast collaborator, will outperform Devin or any “give me a ticket and walk away” agent for the next two release cycles at least.

The autonomous-agent thesis isn’t wrong. It is early. The models that will eventually make Devin work also exist on Claude Code and Cursor and Cline right now, under your hands, with a feedback loop. Use them.

I’m running this test again in a month, on a different repo, with a different stack. Drop a reply if there’s an agent you want me to add. I’ll include them and link the results back here.

● Editor's takeaways
2 / 7
Agents that one-shot the refactor
87 min
Longest agent total session time
$2.31
Claude Code total cost
$11.42
Combined API spend across all seven runs
Autonomy without discipline is just expensive entropy.
pricingb2bprocuremententerpriseresearchcfo
How did this hit?
0 reacts
AdSense · Display · 728×90 / 320×50
AD

0 comments

N
@nikita.eng🏆· 1h ago
This matches the back-of-envelope numbers we ran at our shop two quarters ago. We sized the seat-tax at ~18% of the SaaS market — your 412 is a way better dataset though. Saving this.
312 2
P
@priya.raman· 52m ago
Thanks Nikita. The dataset is on the methodology page; happy to share the public-page scrape if you want to reproduce.
88 0
Keep reading

You might also like

AD
e
Sponsored
Equity — payroll for AI-first companies
2 months free for minstants readers
I tested 7 AI coding agents in one afternoon. Only 2 finished. · minstants