I gave seven AI coding agents the exact same chore on a Tuesday afternoon. Refactor a 340-line React component out of one big Context provider into three useReducer-backed slices. Same repo, same prompt, same Claude 4.6 Sonnet endpoint where the agent allowed me to bring my own model.
Two of them shipped working code on the first try. One of them shipped working code on the third. Four of them shipped something that compiled and broke the tests in a way that took me longer to diagnose than just rewriting it myself.
This is the part of the AI coding wars no benchmark captures. Not the SWE-bench number. Not the marketing screenshot. The actual middle-of-the-day, did-the-thing-work test.
The setup, no kidding-around
The repo: a real internal-tools Next.js 15 app I’ve been tending for six months. The component: a single file at src/components/DashboardProvider.tsx, 340 lines, four contexts inside one provider, with selectors that re-rendered every consumer on every change. Classic 2023 sin, well-documented in my own pull request history.
The prompt: “Split DashboardProvider into three useReducer-backed contexts: FiltersContext, SelectionContext, and PreferencesContext. Preserve the existing public API. Update consumers. Keep tests green.”
The scoring rules I held myself to:
- Pass: every existing test green, no TypeScript errors, no behavioral regression I could spot in 90 seconds of click-testing.
- Partial: shipped something that compiled but needed one or two human edits to pass tests.
- Fail: shipped something I couldn’t fix in under 10 minutes of human time.
I ran each agent once. No second tries. No “regenerate.” The whole point was to see what they do when you treat them like a coworker who only gets one shot, not a slot machine you keep pulling.

The ranked verdict
1. Claude Code (Sonnet 4.6) — Pass
Read every relevant file before touching anything. Asked me one clarifying question about whether PreferencesContext should persist to localStorage (yes). Shipped three new files, a refactor of the consumers, and a passing test suite in about 14 minutes. Cost: $2.31 in API spend. The diff was the cleanest of the seven and the only one that bothered to add a brief comment explaining why each split was justified.
This is the boring answer. It is also the right answer in my test.
2. Cursor (Agent mode, Sonnet 4.6 forced) — Pass
Faster than Claude Code (11 minutes). The reads were aggressive — it pulled in twelve files where Claude Code pulled in four. Diff was about 30% noisier with reorganization the prompt didn’t ask for. But it worked. Tests green. I had to manually accept one diff where it tried to rename a prop, which is annoying but not fatal.
If you’re the kind of engineer who likes a chatty intern, this is your tool. If you’re not, the noise will grate.
3. Codex CLI (GPT-5.5 default) — Partial
Strong start. Codex actually surfaced a side issue I’d ignored — the existing tests didn’t cover one of the contexts I wanted to split — and offered to write the missing test before doing the refactor. I declined (out of scope). It shipped the refactor in 18 minutes, but it forgot to update one consumer in a deeply-nested route file. The tests still passed because no test exercised that route. Caught it when I clicked through.
Codex is, in my test, the agent most likely to tell you what you don’t want to hear. That is genuinely valuable. It cost me 22 minutes of human review to catch the miss.
4. Cline (Sonnet 4.6, BYOK) — Partial
The open-source dark horse. Cline’s plan-then-act mode is genuinely excellent — it produced a written plan, asked me to approve it, then executed cleanly. The refactor was correct. The problem was the test setup. Cline didn’t realize I was using Vitest with a custom setupTests.ts file, and tried to patch a Jest-style mock that doesn’t exist in my repo. Five minutes to fix.
If you’re running a stock CRA repo, Cline probably one-shots this. On anything bespoke, you’re a coworker, not a customer.
5. Aider (Sonnet 4.6, BYOK) — Partial
Aider is the git-native old soul of this list. The diff format is gorgeous. The commits are atomic. The agent is also the most likely to ask you to manually add files to the chat — and if you skip a file it needs, the refactor will be technically wrong in a way that looks technically right.
I forgot to add my types.ts to chat. Aider shipped a refactor that compiled because the inferred types happened to match. It also silently dropped two optional props that lived only in the type definition. Tests passed. Behavior changed. (This is the kind of bug you only catch in code review or in production. Guess which one I caught it in.)



