The Reasoning Sandwich | Explainers

Read this as Which phase deserves the expensive thinking budget?

Failure Trap: Setting max reasoning globally and paying for long thoughts during tactical file edits.
Decision Rule: Allocate max to plan and verify, mid to implementation, then validate on a held-out task suite.

1 / ?

Agent work has phases

The harness can label calls as plan, implement, or verify. That phase label is the allocation surface.

Planning decides the route.
Implementation executes many narrow steps.
Verification decides whether to ship.

Max everywhere underperformed

The source benchmark reports max-tier reasoning on every phase scoring 53.9 percent on Terminal Bench 2.0.

Same model and task suite.
Only the allocation changed.
More thinking was not better.

The middle phase pays the tax

Implementation has many short, tactical calls. Max reasoning there can cause timeouts, over-thinking, and trace bloat.

Timeouts spend time thinking instead of testing.
Obvious edits get second-guessed.
Long traces crowd the context.

The sandwich spends reasoning at the edges

The winning shape keeps max reasoning for planning and verification but uses a mid tier while executing.

Plan is high-asymmetry.
Implementation is high-volume.
Verify is high-asymmetry again.

The measured score rose to 66.5 percent

On the cited run, the sandwich scored 66.5 percent, a 12.6 percentage-point lift over max-everywhere.

Same model.
Same task suite.
The harness changed the allocation.

The rule is asymmetry times volume

Spend thinking where wrong decisions compound; shrink it where many small actions are already externally verified.

Do not let the model pick the budget.
Map semantic tiers to provider literals.
Tune per task class with evals.