The Reasoning Sandwich

See why max reasoning belongs at plan and verify, while implementation often works better at a mid tier.

Read this as Which phase deserves the expensive thinking budget?
Failure Trap
Setting max reasoning globally and paying for long thoughts during tactical file edits.
Decision Rule
Allocate max to plan and verify, mid to implementation, then validate on a held-out task suite.
The Reasoning Sandwich See why max reasoning belongs at plan and verify, while implementation often works better at a mid tier. 3 phases 3 phases plan implement verify phase Max all Max all xhigh plan xhigh impl xhigh verify 53.9% Failure modes Failure modes timeouts overthink trace bloat tax Sandwich Sandwich max plan mid impl max verify phase budget Receipt Receipt 53.9 to 66.5 +12.6 pp 66.5% Rule Rule asymmetry step volume held-out eval allocate
1 / ?

Agent work has phases

The harness can label calls as plan, implement, or verify. That phase label is the allocation surface.

  • Planning decides the route.
  • Implementation executes many narrow steps.
  • Verification decides whether to ship.

Max everywhere underperformed

The source benchmark reports max-tier reasoning on every phase scoring 53.9 percent on Terminal Bench 2.0.

  • Same model and task suite.
  • Only the allocation changed.
  • More thinking was not better.

The middle phase pays the tax

Implementation has many short, tactical calls. Max reasoning there can cause timeouts, over-thinking, and trace bloat.

  • Timeouts spend time thinking instead of testing.
  • Obvious edits get second-guessed.
  • Long traces crowd the context.

The sandwich spends reasoning at the edges

The winning shape keeps max reasoning for planning and verification but uses a mid tier while executing.

  • Plan is high-asymmetry.
  • Implementation is high-volume.
  • Verify is high-asymmetry again.

The measured score rose to 66.5 percent

On the cited run, the sandwich scored 66.5 percent, a 12.6 percentage-point lift over max-everywhere.

  • Same model.
  • Same task suite.
  • The harness changed the allocation.

The rule is asymmetry times volume

Spend thinking where wrong decisions compound; shrink it where many small actions are already externally verified.

  • Do not let the model pick the budget.
  • Map semantic tiers to provider literals.
  • Tune per task class with evals.