Read this as Which phase deserves the expensive thinking budget?
- Failure Trap
- Setting max reasoning globally and paying for long thoughts during tactical file edits.
- Decision Rule
- Allocate max to plan and verify, mid to implementation, then validate on a held-out task suite.
Agent work has phases
The harness can label calls as plan, implement, or verify. That phase label is the allocation surface.
- Planning decides the route.
- Implementation executes many narrow steps.
- Verification decides whether to ship.
Max everywhere underperformed
The source benchmark reports max-tier reasoning on every phase scoring 53.9 percent on Terminal Bench 2.0.
- Same model and task suite.
- Only the allocation changed.
- More thinking was not better.
The middle phase pays the tax
Implementation has many short, tactical calls. Max reasoning there can cause timeouts, over-thinking, and trace bloat.
- Timeouts spend time thinking instead of testing.
- Obvious edits get second-guessed.
- Long traces crowd the context.
The sandwich spends reasoning at the edges
The winning shape keeps max reasoning for planning and verification but uses a mid tier while executing.
- Plan is high-asymmetry.
- Implementation is high-volume.
- Verify is high-asymmetry again.
The measured score rose to 66.5 percent
On the cited run, the sandwich scored 66.5 percent, a 12.6 percentage-point lift over max-everywhere.
- Same model.
- Same task suite.
- The harness changed the allocation.
The rule is asymmetry times volume
Spend thinking where wrong decisions compound; shrink it where many small actions are already externally verified.
- Do not let the model pick the budget.
- Map semantic tiers to provider literals.
- Tune per task class with evals.