The Reasoning Sandwich: Why More Thinking Made My Agent Worse | Intentional / Deliberate / Engineering

Prerequisite: Part 3 of the Harness Engineering deep dive. Read Part 02: The Four Primitives first if you have not seen the planning-and-verification framing.

A three-layer sandwich diagram. The top and bottom buns are labelled 'PLAN — max tier' and 'VERIFY — max tier' and glow with heat. The thin middle filling is labelled 'IMPLEMENTATION — mid tier' and stays cool. Below the sandwich, two bars: 53.9 percent in red for max-everywhere and 66.5 percent in green for the sandwich. The sandwich wins by 12.6 percentage points. — The reasoning sandwich

Why This Matters

Every agent tutorial that mentions thinking budgets tells you to crank reasoning to max and walk away. The model gets smarter the more it thinks, the post says. Give it the budget. Watch the score climb. The reasoning is so obvious it does not need defending — until somebody actually runs the experiment on a same-model, same-task benchmark and the score goes the other way.

LangChain ran that experiment on Terminal Bench 2.0 in February 2026 [lch-harness2026]. Single model — gpt-5.2-codex. Single task suite — 89 long-horizon (multi-step, multi-minute) coding and debugging tasks. The configuration with reasoning_effort: xhigh set on every step scored 53.9%. The configuration with xhigh on planning and verification but standard in the middle scored 66.5%. A 12.6-point swing from removing reasoning where the agent was doing the work.

Most public explainers — the LangChain harness-engineering write-up, the OpenAI cookbook examples for gpt-5/codex, the Anthropic adaptive-thinking doc — show the winning configuration but skip the worked example, the timeout failure mode, and the trace-bloat tax [lch-harness2026][gpt5-codex][anthropic-thinking]. That is what this chapter covers — the parts that decide whether your agent costs more and scores worse, not the parts that fit on a slide. Every claim cites a benchmark, a paper, or a provider doc.

ALLOCATION PROFILES (Terminal Bench 2.0, gpt-5.2-codex)

Phase →        plan        implement      verify        Score
─────────────────────────────────────────────────────────────
MAX           xhigh ████   xhigh ████    xhigh ████    53.9 %
MIN           low   █      low   █       low   █       (timeouts down,
                                                      but no planning)
SANDWICH      xhigh ████   stnd  ██      xhigh ████    66.5 %  ◀ winner

Delta vs MAX: +12.6pp by reducing thinking in the middle.
Tier labels are LangChain's benchmark notation: xhigh = top tier
(OpenAI effort: "high" / Anthropic large budget_tokens).

Takeaway: More reasoning is not better. Strategically allocated reasoning is. The middle phase is where max-thinking costs you points.

The Single Number That Breaks Intuition

The number is the headline. Same model, same task suite, only one variable changed.

Terminal Bench 2.0 is a 89-task long-horizon agent benchmark covering ML engineering, debugging, and biology workflows [tbench2026]. Each task has a deterministic pass/fail signal, so the score is not a vibes-judge artifact. LangChain published the run under their harness-engineering write-up [lch-harness2026]:

Configuration	Plan reasoning	Implementation reasoning	Verify reasoning	Score
Max-everywhere	`xhigh`	`xhigh`	`xhigh`	53.9%
Sandwich	`xhigh`	`standard`	`xhigh`	66.5%

The two configurations differ in exactly one parameter: the implementation phase. The model is identical (gpt-5.2-codex). The agent harness is identical. The benchmark is identical. The judge is deterministic. The 12.6-point delta is the cost of leaving xhigh on during the part of the task where the agent is editing files and running commands.

A 12.6-point delta is not a tuning curiosity. LangChain reports that the same harness-only allocation choices lift the configuration from Top-30 to Top-5 on the Terminal Bench leaderboard [lch-harness2026]. The same write-up notes a 52.8% baseline with no harness changes at all — so harness-only allocation choices, not model upgrades, explain the bulk of the climb [lch-harness2026]. The model did not get smarter. The middle of the sandwich got thinner.

The result is also stable, not a single-run artifact. Terminal Bench scores each of the 89 tasks deterministically; LangChain’s published numbers are the aggregated rate, not a hand-picked best run [tbench2026]. The harness change is the independent variable; everything else — model weights, task suite, scoring rubric, retry budget — is held. Most benchmark deltas this size come bundled with a model upgrade, a prompt rewrite, and a different judge. This one comes from one parameter, one phase, one direction.

The cost picture follows the same logic. The 12.6pp accuracy gain is paired with a strictly lower per-task token spend, since xhigh-tier billing dominates per-call cost and the sandwich removes it from the longest phase.

Takeaway: 53.9% → 66.5% on Terminal Bench 2.0, same gpt-5.2-codex, one knob changed. The middle of the agent loop is where over-reasoning hurts the most.

The Shape of the Sandwich

Three phases. Two thick layers of reasoning on the outside, one thin layer in the middle.

The plan phase is where the agent reads the task, surveys the codebase, and decides what to do. The output is a plan — a list of files to touch, tests to run, hypotheses to check. The plan phase is information-dense and asymmetric: a bad plan compounds across every subsequent step, while a good plan reduces the implementation phase to mechanical execution. Reasoning effort pays back here because the search space is large and the cost of getting it wrong is the whole task.

The implementation phase is where the agent runs the plan. Edit a file. Run a test. Read a stack trace. Apply a fix. Each step is short, tactical, and verifiable on its own. The decision surface at each step is narrow — usually a single tool call, a single edit, a single shell command. Asking a frontier model to think for 30 seconds before every line edit produces longer reasoning traces that fight for context budget with the actual file contents and do not change the next tool call.

The verify phase is where the agent decides whether the work is done. Read the test output. Compare against the spec. Look for regressions. Decide between “ship”, “fix”, and “ask the human”. This is the second high-stakes decision: a false-positive completion is a regression in production, a false-negative completion is wasted iteration. Reasoning effort pays back here for the same reason it pays back at plan time — the wrong decision compounds.

“xhigh” and “standard” are not opinion labels — they are semantic phase tiers that the harness maps onto concrete provider parameters. Anthropic exposes thinking with a budget_tokens field, a positive integer controlling the maximum tokens the model spends on internal reasoning before producing a visible response [anthropic-thinking]. OpenAI exposes reasoning.effort as a categorical control with four levels — minimal, low, medium, high — mapped to internal token budgets per provider model [gpt5-codex]. LangChain’s benchmark notation xhigh is shorthand for “the maximum effort tier available,” not a literal provider enum; on OpenAI it resolves to effort: "high", and on Anthropic it resolves to a large budget_tokens integer. The two knobs are cousins, not the same: Anthropic gives you a number, OpenAI gives you a four-level enum. Both control the same underlying behaviour — how many tokens the model spends thinking before answering — but the semantics of each level differ across providers and across model generations within a provider. Pin the version of both the model and the parameter in your eval harness.

A minimal sandwich allocation in code looks like this:

# pseudocode — adapt to your provider SDK.
# Semantic tiers ("plan-tier", "impl-tier") get mapped to provider literals.
# LangChain's "xhigh" notation = effort: "high" on OpenAI (their top tier)
# = a large budget_tokens integer on Anthropic. Tune both per task class.
def allocate_reasoning(phase: str) -> dict:
    if phase == "plan":
        return {"reasoning": {"effort": "high"}}              # OpenAI
        # or {"thinking": {"budget_tokens": 16000}}            # Anthropic — illustrative
    if phase == "implement":
        return {"reasoning": {"effort": "medium"}}            # OpenAI
        # or {"thinking": {"budget_tokens": 4000}}             # Anthropic — illustrative
    if phase == "verify":
        return {"reasoning": {"effort": "high"}}              # OpenAI
        # or {"thinking": {"budget_tokens": 16000}}            # Anthropic — illustrative
    raise ValueError(f"unknown phase: {phase}")

# The harness picks the phase, not the model.
plan = call_model(messages, **allocate_reasoning("plan"))
for step in plan.steps:
    result = call_model(step.messages, **allocate_reasoning("implement"))
verdict = call_model(judge_messages, **allocate_reasoning("verify"))

The harness, not the model, decides which phase each call belongs to. That is the load-bearing line. A single-prompt agent that hands the model one giant task with one reasoning setting cannot get this benefit — there is no plan/implement/verify boundary to allocate across. The sandwich requires a phased loop.

One subtle point about the snippet above: each call is a separate model invocation. The plan call ends, its reasoning trace is summarized into a structured plan, and then implementation calls run against that plan with fresh reasoning budgets. The verify call reads the implementation outputs and runs again with its own budget. This matters because reasoning tokens are scoped to a call. A 16,000-token plan-phase budget does not carry over to implementation — it is spent or dropped at the call boundary. The harness gets the benefit precisely because the budget is per-call. If you collapse the loop into one streaming request with one budget, you have lost the allocation knob even though the parameter is still set.

Takeaway: Three phases, three settings. Plan and verify get the max tier; implementation gets a mid tier. The harness picks the phase, not the model — single-prompt agents cannot benefit.

Why Max-Everywhere Underperforms

Three independent failure modes drag max-everywhere below the sandwich. They compound.

First, budget exhaustion and timeouts. Reasoning tokens cost wall-clock time and dollars on every step. Long-horizon agent benchmarks like Terminal Bench 2.0 cap total wall-clock per task, typically 10–30 minutes [tbench2026]. An agent that spends ~60 seconds of max-tier reasoning before each of ~40 tool calls burns the budget on thinking, not executing (illustrative figures, estimated from the Terminal Bench task profile). The harness then either kills the task as timed-out or returns a partial solution. LangChain’s write-up names this directly: max-reasoning runs hit the timeout cap on tasks the sandwich finished with budget to spare [lch-harness2026]. Time spent thinking is time not spent running the next test.

Second, over-thinking degrades short tactical decisions. Frontier reasoning models trained to extend chains-of-thought (CoT) keep extending them when given the budget — even on decisions where the correct answer is obvious in two tokens. A tool call that should read “open foo.py, line 42, change + to -” turns into a 4,000-token reasoning trace re-deriving why subtraction is correct. The longer the trace, the more entry points for the model to second-guess itself, generate an alternative hypothesis, and pick the wrong branch. LangChain’s write-up that named the 53.9% baseline describes the pathology directly: the model treats every routine line edit as if it were a deep math problem, re-deriving the obvious instead of executing it [lch-harness2026]. The pathology is not that the model fails to find the right answer. It is that the model keeps looking after it has already found it.

Third, trace bloat consumes the context window. Reasoning traces are tokens. Even when the provider exposes the trace as a separate field (Anthropic’s thinking block) or hides it entirely (OpenAI’s internal CoT), the tokens count against the model’s effective working memory. Chroma’s context-rot work shows that retrieval and decision accuracy degrade as input length grows — not as a cliff but as a gradient that starts well below the nominal context-window limit [chroma-rot]. A 200K-token model with 150K of file contents and 30K of accumulated reasoning trace is doing decisions on a noisier substrate than one with 30K of trace removed. The implementation phase emits the most steps per task, so trace bloat from max-tier reasoning at implementation time compounds fastest.

ALLOCATION SHAPE BY TASK CLASS (max / mid / low tiers)

Task class        plan         implement    verify
─────────────────────────────────────────────────────────
code-write        max ████     mid  ██      max ████     ◀ canonical sandwich
support-triage    mid ██       low  █       mid ██       ◀ shallow sandwich
research          max ████     mid  ██      max ████     ◀ thicker middle
retrieval         low █        low  █       mid ██       ◀ no real plan
critique-only     —            —            max ████     ◀ verify-only

Read vertically: each column is a phase, bar height = reasoning tier.
code-write is the benchmarked shape (53.9 → 66.5pp on Terminal Bench).
Other rows are extrapolations — replace each with your own eval.

The three failure modes are not three competing explanations of the same effect — they are three different mechanisms, each measurable on its own. Removing any one of them recovers some points; removing all three recovers the full 12.6pp.

A diagnostic for over-reasoning

If your max-everywhere agent fails almost entirely with timeout errors, the dominant failure mode is budget exhaustion. If it finishes within budget but produces verbose reasoning traces ending in confidently wrong tool calls, the dominant mode is over-thinking. If accuracy is high for the first 15 steps and degrades from step 20 onward, the dominant mode is trace bloat eating the effective context window. The three modes can be teased apart with a short timing-and-token-count instrumentation pass over the agent transcript, and the sandwich’s gain is the sum of the recovered points from each.

Takeaway: Max-everywhere fails three independent ways: timeouts, over-thinking on tactical steps, and trace bloat. The sandwich attacks all three.

How to Allocate (the operator playbook)

The two-dimensional rule is: reasoning effort scales with decision asymmetry and shrinks with step volume. High-asymmetry decisions (plan, verify) get the max tier. High-volume narrow steps (implementation, retrieval, format-only transforms) get mid tier or below. The tier names below are semantic; map them to your provider literal — Anthropic budget_tokens integer or OpenAI reasoning.effort enum (minimal | low | medium | high) — in your provider adapter.

A first-pass allocation table by task class. Only the code-write row is benchmarked — the 53.9 → 66.5 number is from Terminal Bench 2.0 [lch-harness2026]. The other rows are operator extrapolation by analogy to that result; treat each as a hypothesis and replace it with your own offline eval before shipping.

Task class	Plan	Implementation	Verify	Notes
Code-write (file edits + test)	max	mid	max	The canonical Terminal Bench shape [lch-harness2026].
Support-triage (classify + draft + escalate)	mid	low	mid	Narrow plan space. Reserve mid-tier for the escalation decision.
Multi-step research (read → synthesize → cite)	max	mid	max	Implementation includes synthesis, which is not purely tactical.
Pure retrieval (RAG-style Q&A)	low	low	mid	No real plan phase. Verify catches grounding failures.
Reflection-only critique	n/a	n/a	max	One phase, single asymmetric decision. Spend the budget.

Three worked examples:

Example 1 — Code-write. Task: “patch a flaky test, run the suite, open a PR.” Plan at max tier for roughly 15 seconds (illustrative — wall-clock depends on model and provider): identify the flake source, list the files to touch, decide whether to fix the test or the system. Implementation at mid tier: each edit, each shell command, each diff inspection runs at default reasoning. Verify at max tier: read the CI output, confirm no regressions, decide ship vs. needs-human. Typical step count: 1 plan, 12–25 implementation, 1 verify. Reasoning budget concentrates at the two ends.

Example 2 — Support-triage. Task: “classify the inbound ticket, draft a response, escalate if blocked.” This is the wrong shape for the canonical sandwich. The plan space is narrow (a finite set of categories), the implementation is a single short response draft, and the verify decision is whether to escalate. Use a mid tier at plan and verify, low at implementation. Reserving max-tier reasoning here wastes budget on a problem with no asymmetric high-stakes decision.

Example 3 — Multi-step research. Task: “given this question and 30 source documents, write a sourced answer.” Plan at max tier: decide which documents to read in what order, what claims to seek, what would falsify them. Implementation at a mid tier (not the lowest): each synthesis step is non-trivial reasoning over cited material, so the trade-off shifts. Verify at max tier: check every claim against its citation, flag unsourced statements. The middle layer is thicker than for code-write because synthesis is not as tactical as line-editing.

A policy primitive that captures this:

# Starting point — replace per task class after offline eval.
# Tiers are semantic; the provider adapter maps them to literals
# (Anthropic budget_tokens int, OpenAI reasoning.effort enum).
PHASE_DEFAULTS = {
    "code-write":     {"plan": "max",    "implement": "mid",      "verify": "max"},
    "support-triage": {"plan": "mid",    "implement": "low",      "verify": "mid"},
    "research":       {"plan": "max",    "implement": "mid",      "verify": "max"},
    "retrieval":      {"plan": "low",    "implement": "low",      "verify": "mid"},
    "critique-only":  {"plan": None,     "implement": None,       "verify": "max"},
}

def reasoning_for(task_class: str, phase: str) -> str | None:
    return PHASE_DEFAULTS[task_class][phase]

The table is a starting point, not a contract. Tune it per task class with an offline eval — Terminal Bench 2.0 for code-write tasks, a held-out triage set for support, a citation-grounding eval for research. The 12.6pp number is a benchmark on one task class; your task class may sandwich differently.

Two anti-patterns to call out before they ship. First, do not let the model choose its own reasoning level. Reasoning models are trained to extend chains of thought; given a “decide your own budget” instruction, they will not voluntarily shrink. Allocation is a harness decision, made before the call and pinned in code. Second, do not pick reasoning levels by latency alone. Latency is a constraint, not a signal — max-tier reasoning is slow because the model is thinking, not because the parameter is misconfigured. Picking the lowest level that still meets the wall-clock budget skips the asymmetry question entirely, and the resulting agent under-thinks the plan and verify decisions where the cost of being wrong is the whole task.

Takeaway: Allocate reasoning by decision asymmetry × step volume. Code-write sandwiches deeply. Triage barely sandwiches at all. Tune the table with an eval, not a guess.

When the Sandwich Does Not Apply

The sandwich is a phased-loop pattern. Four conditions break it.

Single-phase tasks. If the agent does not have distinct plan, implement, and verify phases — for example, a one-shot text classification — there is no allocation surface. Pick the reasoning level appropriate to the decision and stop. Forcing a phased loop onto a single-decision task adds latency without recovering accuracy.

Very small models. On models without a real internal chain-of-thought channel — typically sub-frontier-tier open-weight checkpoints — the reasoning channel often collapses: the model produces a short reasoning preamble regardless of the budget setting, and the score is dominated by base capability rather than reasoning depth. Provider docs for adaptive thinking warn that the parameter is only meaningful on models trained to use it [anthropic-thinking]. Below that threshold, max-tier and mid-tier settings are nearly indistinguishable in output — and any difference is noise. Skip the sandwich; spend the engineering elsewhere.

Latency-bound real-time tasks. Voice agents, autocomplete, and any interactive surface with a sub-second budget cannot afford max-tier reasoning anywhere. Reasoning tokens are wall-clock time. A user on a phone call does not care that your verify step is asymmetric — they care that the response arrived in 300ms. Run these at the lowest tier (minimal or low on OpenAI, a small budget_tokens value on Anthropic) everywhere, accept the accuracy ceiling that imposes, and put the saved reasoning budget into a separate background quality-checker rather than the inline path.

Reasoning-trace-as-output tasks. Some workloads want the trace as the deliverable: math proofs, formal verification, step-by-step tutoring explanations. Here the implementation phase is the reasoning trace — the visible output is the trace plus a final answer. Shrinking the middle layer would shrink the deliverable. Run max-tier reasoning end-to-end and budget separately for it; the sandwich pattern is solving a problem these workloads do not have.

Condition	Why the sandwich breaks	What to do instead
Single-phase task	No plan/implement/verify boundary to allocate across	Pick one reasoning level, stop
Model without internal-CoT capability	Reasoning parameter is a near no-op below the capability threshold	Skip the parameter; tune the prompt instead
Latency-bound real-time	Wall-clock budget < single max-tier step	Lowest tier everywhere; background quality-check separately
Reasoning-trace-as-output	The trace is the deliverable	Max tier end-to-end; do not shrink the middle

Takeaway: The sandwich is a phased-loop optimization. If your loop is single-phase, your model is too small, your task is latency-bound, or your trace is the output — the pattern does not apply.

Do This, Not That

Read this table as semantic phase tiers (plan-tier, impl-tier, verify-tier) with the per-provider literals shown as illustrative defaults. Anthropic budget_tokens accepts any integer; OpenAI reasoning.effort is an enum of minimal | low | medium | high. Tune both per task class with an eval — the literals below are starting points, not contracts.

Pattern	Anthropic literal (illustrative)	OpenAI literal	Verdict
Max-tier reasoning on every step of a long agent loop	`budget_tokens: 16000` on every call	`reasoning.effort: "high"` on every call	Don’t. Burns wall-clock budget, bloats context, and re-derives obvious decisions. -12.6pp on Terminal Bench 2.0 [lch-harness2026].
Sandwich: plan-tier max, impl-tier mid, verify-tier max	`budget_tokens: 16000 / 4000 / 16000`	`reasoning.effort: "high" / "medium" / "high"`	Do. Canonical allocation for code-write [lch-harness2026].
Allocation chosen by the model itself (“decide your own thinking budget”)	No supported mode	No supported mode	Don’t. Models trained to extend CoT will not choose to shrink it. Allocation is the harness’s job.
Same reasoning setting for one-shot classification	`budget_tokens: 8000` (overkill)	`reasoning.effort: "high"` (overkill)	Don’t apply the sandwich. One-phase task; pick a single level matched to the decision asymmetry.
Max-tier reasoning trace as the deliverable (math proof, tutoring)	`budget_tokens: 32000` end-to-end	`reasoning.effort: "high"` end-to-end	Do. The trace is the output; shrinking the middle shrinks the product.
Eval reasoning settings on the production benchmark before flipping	Run held-out eval per allocation	Run held-out eval per allocation	Do. The 12.6pp number is one task class on one benchmark. Yours will differ.
Pin provider, model version, and reasoning parameter together in the harness config	Pin `claude-sonnet-X.Y` + `budget_tokens`	Pin `gpt-5.2-codex` + `reasoning.effort`	Do. Reasoning-parameter semantics drift across model versions; an upgrade silently rescales budgets [anthropic-thinking][gpt5-codex].
Conflate `thinking.budget_tokens` with `reasoning.effort` in code	n/a	n/a	Don’t. Cousins, not synonyms. Anthropic exposes a token count, OpenAI a four-level enum. Map them explicitly in your provider adapter.

Takeaway: Allocate by phase, pin by version, eval before flipping. Treat the two providers’ knobs as cousins, not synonyms.

Gotchas

Gotcha	Symptom	Fix
Reasoning parameter set globally on the client	Every call inherits the top tier; the sandwich has no effect.	Pass reasoning parameters per-call, not on the client constructor. The harness owns allocation.
Provider upgrade silently rescales reasoning budgets	Score drops 3–8pp after a model version bump with no code change.	Re-run the eval matrix on every provider version. Reasoning-parameter semantics are not stable across model generations [anthropic-thinking].
Reasoning trace counts against effective context	Accuracy degrades after step ~20 in long agent loops, not before.	Drop or summarize prior `thinking` blocks between steps. Max-tier reasoning at every step compounds trace bloat fastest [chroma-rot].
Sandwich requires phase labels	Policy never fires because every call is tagged the same phase.	Label calls explicitly — plan = first turn (or after explicit replanning), verify = after the final tool returns, implement = everything else. Tie the phase label to tool-call type or a state-machine transition, not to model heuristics.
Streaming-API blur	Harness uses a single long-running streaming session instead of discrete calls; phase boundaries dissolve and the sandwich silently breaks.	Force a call boundary between plan, implement, and verify. Each phase is a separate request with its own reasoning parameter. Streaming inside a phase is fine; streaming across phases collapses allocation.
Prompt-cache interaction	Cache-hit ratio drops after enabling the sandwich; per-task cost climbs even though accuracy did.	Reasoning tokens are not cache-eligible on most providers, and switching reasoning levels mutates the request envelope. Sequence the phase order so the cacheable prefix (system + tools + early context) stays stable across calls. See the prompt-cache chapter for the mechanics.
Mid-tier is not the same level across providers	Cross-provider eval shows the sandwich works for OpenAI and not Anthropic, or vice versa.	Map each provider’s mid tier explicitly. Concrete example: Anthropic `thinking.budget_tokens: 4000` ≈ OpenAI `reasoning.effort: "medium"` (illustrative — calibrate per task and per model generation).
Plan phase asks model to plan and execute together	The model produces a one-shot solution, skipping the plan layer entirely.	Force a structured plan output (JSON or markdown checklist) before any tool call. The plan layer must emit a plan, not the answer.
Verify phase trusts the model’s “I’m done” claim	False-positive completions ship as regressions.	Verify must read tool output (test results, lint, type-check), not just the agent’s self-report. The asymmetric decision is “ship or not”, which needs independent evidence.
Reasoning budget consumed by the system prompt itself	Token budgets exhaust before the model gets to the user task.	Account for reasoning tokens separately from input tokens in the budget calculator. Anthropic and OpenAI bill them differently — read the provider doc before estimating cost.
Single-prompt agent with one reasoning setting	The team adopts the sandwich pattern and sees no improvement.	The sandwich requires a phased loop with distinct plan, implement, and verify calls. A one-prompt agent has no allocation surface to optimize.

Takeaway: Most gotchas reduce to: allocate per-call (not globally), pin the model version, drop the trace between steps, and force a structured plan before any tool call.

What the Sandwich Teaches About Harness Engineering

The model’s parameters are part of the harness, not part of the model. A 12.6pp swing from one categorical knob — chosen per phase by the harness, not the model — is the simplest possible demonstration that “model capability” and “agent performance” are different quantities. The next chapter shows the second mechanism that depends on phased control: 04 — Coordinator Mode, where a single feature flag turns one model into a working three-layer multi-agent system without changing model weights at all.

Takeaway: Reasoning allocation is harness work. The model does not know which phase it is in — the harness does.

References

[lch-harness2026] LangChain, “Improving Deep Agents with Harness Engineering,” February 2026. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/ — Terminal Bench 2.0 results: 52.8% baseline → 66.5% sandwich; 53.9% max-tier-everywhere; Top-30 → Top-5 leaderboard movement; timeout failure-mode characterization; over-thinking pathology on tactical edits.
[anthropic-thinking] Anthropic, “Adaptive Thinking,” platform documentation. https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking — thinking.budget_tokens parameter semantics, model-version stability caveats, applicable model tiers.
[gpt5-codex] OpenAI, “GPT-5 Codex Prompting Guide,” developer cookbook. https://developers.openai.com/cookbook/examples/gpt-5/codex_prompting_guide/ — reasoning.effort categorical control (minimal, low, medium, high); per-model behaviour. LangChain’s benchmark notation xhigh denotes the top tier and resolves to effort: "high" for OpenAI models.
[chroma-rot] Chroma Research, “Context Rot.” https://research.trychroma.com/context-rot — accuracy as a degrading gradient with input-length growth; relevant to reasoning-trace bloat in long agent loops.
[tbench2026] Terminal Bench 2.0 leaderboard and task spec. https://www.tbench.ai/leaderboard/terminal-bench/2.0 — 89 tasks across ML engineering, debugging, and biology; per-task wall-clock cap; deterministic pass/fail signal.

Next chapter: 04 — Coordinator Mode: A Working Multi-Agent System, From the Source

One question for the reader: In your current agent, can you point to the line of code that picks the reasoning budget? If it is set once on the client and never overridden per call, your sandwich is one ingredient.