I gave seven AI coding agents the exact same chore on a Tuesday afternoon. Refactor a 340-line React component out of one big Context provider into three useReducer-backed slices. Same repo, same prompt, same Claude 4.6 Sonnet endpoint where the agent allowed me to bring my own model.

Two of them shipped working code on the first try. One of them shipped working code on the third. Four of them shipped something that compiled and broke the tests in a way that took me longer to diagnose than just rewriting it myself.

This is the part of the AI coding wars no benchmark captures. Not the SWE-bench number. Not the marketing screenshot. The actual middle-of-the-day, did-the-thing-work test.

2 / 7

Agents that one-shot the refactor

87 min

Longest agent total time, end-to-end

$11.42

Total API spend for all seven runs combined

The setup, no kidding-around

The repo: a real internal-tools Next.js 15 app I’ve been tending for six months. The component: a single file at src/components/DashboardProvider.tsx, 340 lines, four contexts inside one provider, with selectors that re-rendered every consumer on every change. Classic 2023 sin, well-documented in my own pull request history.

The prompt: “Split DashboardProvider into three useReducer-backed contexts: FiltersContext, SelectionContext, and PreferencesContext. Preserve the existing public API. Update consumers. Keep tests green.”

The scoring rules I held myself to:

Pass: every existing test green, no TypeScript errors, no behavioral regression I could spot in 90 seconds of click-testing.
Partial: shipped something that compiled but needed one or two human edits to pass tests.
Fail: shipped something I couldn’t fix in under 10 minutes of human time.

I ran each agent once. No second tries. No “regenerate.” The whole point was to see what they do when you treat them like a coworker who only gets one shot, not a slot machine you keep pulling.

Software engineer at a multi-screen workstation, used to illustrate the AI coding agent comparison — Devin’s “first AI software engineer” pitch is what every other agent on this list now gets measured against. · Photo: Getty Images via TechCrunch — “Goldman Sachs is testing viral AI agent Devin as a ‘new employee’”

The ranked verdict

My ranking after one afternoon

Cherry-picked nothing. Listed in the order they finished.

1. Claude Code (Sonnet 4.6) — Pass

Read every relevant file before touching anything. Asked me one clarifying question about whether PreferencesContext should persist to localStorage (yes). Shipped three new files, a refactor of the consumers, and a passing test suite in about 14 minutes. Cost: $2.31 in API spend. The diff was the cleanest of the seven and the only one that bothered to add a brief comment explaining why each split was justified.

This is the boring answer. It is also the right answer in my test.

2. Cursor (Agent mode, Sonnet 4.6 forced) — Pass

Faster than Claude Code (11 minutes). The reads were aggressive — it pulled in twelve files where Claude Code pulled in four. Diff was about 30% noisier with reorganization the prompt didn’t ask for. But it worked. Tests green. I had to manually accept one diff where it tried to rename a prop, which is annoying but not fatal.

If you’re the kind of engineer who likes a chatty intern, this is your tool. If you’re not, the noise will grate.

3. Codex CLI (GPT-5.5 default) — Partial

Strong start. Codex actually surfaced a side issue I’d ignored — the existing tests didn’t cover one of the contexts I wanted to split — and offered to write the missing test before doing the refactor. I declined (out of scope). It shipped the refactor in 18 minutes, but it forgot to update one consumer in a deeply-nested route file. The tests still passed because no test exercised that route. Caught it when I clicked through.

Codex is, in my test, the agent most likely to tell you what you don’t want to hear. That is genuinely valuable. It cost me 22 minutes of human review to catch the miss.

4. Cline (Sonnet 4.6, BYOK) — Partial

The open-source dark horse. Cline’s plan-then-act mode is genuinely excellent — it produced a written plan, asked me to approve it, then executed cleanly. The refactor was correct. The problem was the test setup. Cline didn’t realize I was using Vitest with a custom setupTests.ts file, and tried to patch a Jest-style mock that doesn’t exist in my repo. Five minutes to fix.

If you’re running a stock CRA repo, Cline probably one-shots this. On anything bespoke, you’re a coworker, not a customer.

CLAUDE CODE

14 min · $2.31

One-shot pass, clean diff, asked a clarifying question

DEVIN

87 min · $4.10 + ACUs

Refactored three files, broke two unrelated ones, never recovered

5. Aider (Sonnet 4.6, BYOK) — Partial

Aider is the git-native old soul of this list. The diff format is gorgeous. The commits are atomic. The agent is also the most likely to ask you to manually add files to the chat — and if you skip a file it needs, the refactor will be technically wrong in a way that looks technically right.

I forgot to add my types.ts to chat. Aider shipped a refactor that compiled because the inferred types happened to match. It also silently dropped two optional props that lived only in the type definition. Tests passed. Behavior changed. (This is the kind of bug you only catch in code review or in production. Guess which one I caught it in.)

Essay

I tested 7 AI coding agents in one afternoon. Only 2 finished.

Same React refactor. Same Sonnet 4.6 endpoint. Claude Code and Cursor one-shot it. Devin spent 87 minutes and broke unrelated tests. Here are the real results, by the run, not by the benchmark.

Holt

Contributor · 1w ago

I tested 7 AI coding agents in one afternoon. Only 2 finished.

01Real-world test, not a benchmark: refactor a 340-line React component, 7 agents, one shot each.
02Claude Code and Cursor both passed on the first try. Codex, Cline, and Aider were partial.
03Devin and Continue failed. Devin spent 87 minutes and broke unrelated tests.
04The agents that won did less. The agents that lost did more.
05SWE-bench scores predicted almost nothing about real-repo behavior.

Listen · narrated by the editor

14:22

Chapters