Research

Reasoning is cheap. Awareness isn't.

Model tiers buy reasoning depth. The context layer buys sight. We measured what happens when a small model finally gets to see the whole change.

Rhei Team5 min read

The question

Frontier reasoning is getting cheaper and more automated by the month. So the interesting question for a coding agent is no longer how smart the model is. It is what the model can see. Can a cheap model do most of the work on a real codebase if the context layer gives it sight? We wanted a real answer, so we ran the work in our own monorepo. Port a Rust module to TypeScript with parity tests. Adopt the ported module in a caller path. Consolidate key formats that had drifted apart. Prove that hashes match across the two languages. These tasks have ground truth. The work either lands or it does not.

What we measured first

We already had one public number. Against RepoPrompt on 15 replayed review sessions, Rhei matched quality on 15 of 15 while using 51.6 percent fewer tokens, 89.6 percent fewer tool calls, and finishing 33.5 percent faster.

That number says the context engine is efficient. It does not say whether efficiency is the same thing as sight. So we ran the harder question: what actually moves quality on a small model.

Sight, not reasoning

The lever turned out to be reading, not thinking. A cheap model at low reasoning effort sat at 0.64 mean patch quality. We told it to read every file it would change or cite, enumerate the call sites, and never edit a file it had not read, and the same model jumped to 0.92. The gain came from scope, not from more reasoning.

That points straight at where small models fall down. They are not worse at knowing what to do with the right files in front of them. They are worse at deciding which files matter. Hand them that decision and most of the gap to a larger model closes. Each task cost about 0.52 dollars, against 0.65 for shell brute force, at one third of the tool calls.

Mean patch quality by configuration
0.64narrow scope0.92full scope1.0shell brute force3x tool calls

Shell brute force reaches 1.0, but at three times the tool calls of the full-scope run.

Plans are ceilings

We also tested the orchestration everyone assumes: a strong planner writes the plan, a cheap executor follows it. In 12 paired runs, a planner at high reasoning effort handing a plan to a cheap executor never beat the executor working alone.

The failure mode was deference. When the handoff was binding, as in “follow the plan”, the executor scored 0.55. When the handoff was advisory, as in “the plan is a floor, verify beyond it”, the executor scored 0.92. The same plan helped or hurt depending on whether the executor was allowed to look past it. A plan from a smarter model becomes a ceiling the moment it is treated as an instruction.

Patch quality by handoff mode, 12 paired runs
0.55binding handoff0.92advisory handoff

On complex work, coverage is quality

Scope fixes how a model uses what it can see. Structure decides what it can see at all, and that is where the gap widens. On one ambitious migration across six ground truth files, a frontier model at high reasoning effort with a free shell found three of the six, and never knew it had missed half the surface. The run ended confident. The structured run found all six.

That gap does not show up as a low score. It ships as a change that is quietly half done, and the rework costs more than any token it saved. On complex work you are not buying quality, you are buying coverage, and coverage is quality. The honest unit is cost per complete outcome, not cost per run.

Ground truth files found, one migration, six files
3 of 6free shell6 of 6structured

Not knowing what it did not find is the failure mode no amount of model quality self-detects. Context gathering is solved for greppable, single-surface tasks. On wide, ambiguous surfaces it is not. That gap is the work.

What it means

Model tiers buy reasoning depth. The context layer buys sight. A small model that sees the right files behaves like a bigger one, and a big model that cannot see the edges of a change is still guessing.

This is why we think awareness, not raw reasoning, is becoming the bottleneck. As models take over more of the execution, depth of thought gets cheaper and more automated, and the constraint moves to what the agent is aware of: the files it must touch, the call sites it must not miss, the prior work it should reuse. Reasoning is cheap. Awareness isn’t.

Every claim above is measured: which files were read, which calls ran, what each phase cost. That is the standard we hold ourselves to, and the standard we think agent tooling should be held to.

Benchmarks are single repetition runs on our own repository unless stated. The RepoPrompt comparison is a replayed slice of 15 pairs. We publish the measurements, not vibes.

<- All posts