The Numbers That Killed the 'Wait for Better Models' Excuse | Intentional / Deliberate / Engineering

Prerequisite: Part 10 of the Harness Engineering deep dive. Pairs with Part 03: The Reasoning Sandwich and Part 04: Coordinator Mode — this chapter is the receipts behind their claims.

Same models. Harness-only changes. Six receipts from five organizations.

Same models. Harness-only changes. Six receipts from five independent organizations, double-digit deltas in every row.

Why This Matters

Every team frustrated by a flat bench thinks the next model will save them. The release cadence keeps the hope cheap — wait two months, swap the model string, watch the score climb three points, repeat. Then the next release ships, scores climb three more, the backlog of agent regressions does not move, and the team waits for the next release. “Wait for better models” is not a strategy. It is a posture fed by vendor marketing and “Model X vs Model Y” threads. None of them name a team that lifted a benchmark by changing only the harness.

This chapter does. Five independent organizations, six published receipts (LangChain produced two), six benchmarks — model held constant on every row, double-digit delta on every row. No single receipt is bulletproof. Practitioner benches trail papers; vendor blogs slant; big deltas carry caveats. The aggregation is the point. One +66 swing in isolation is cherry-picking. Five organizations across two years, different benches, different harness mechanisms, all double-digit — that is a pattern, not noise.

WHICH HARNESS LEVER MOVED THE NUMBER

Mechanism             →  Receipt(s) that exercised it
──────────────────────────────────────────────────────────────────
Reasoning allocation  →  [1] LangChain Deep Agents (sandwich shape)
Scaffolding / repo    →  [2] Anthropic SWE-bench Verified
                        [3] Nate B. Jones (practitioner)
Progressive disclosure→  [4] LangChain Skills (instructions on-demand)
Memory architecture   →  [5] GCC (Git Context Controller, memory-as-fs)
                        [6] ACE (generator/reflector/curator)
Verification loop     →  [1] LangChain Deep Agents (test-before-done)
Context curation      →  [6] ACE (curator decides what persists)

Audit question: which of these levers is YOUR harness leaving on the
table? Most teams that say "we tried harness engineering" exercised
one or two and concluded the rest do not apply.

Takeaway: Six receipts span six leverage points. The audit question is which levers your harness leaves untouched, not whether harness engineering “works.”

The Receipts

One table, six rows. The load-bearing column is “Model” — every “After” number is the same model family as the “Before” number, with documented harness changes between them.

#	Team	Benchmark	Before	After	Delta	Model (held constant)	Scoring	Source
1	LangChain Deep Agents	Terminal Bench 2.0	52.8%	66.5%	+13.7 pp	gpt-5.2-codex	deterministic (test-based)	[lch-harness2026]
2	Anthropic	SWE-bench Verified	33%	49%	+16 pp	Claude 3.5 Sonnet	deterministic (test-based)	[anthropic-swe]
3	Nate B. Jones	internal coding bench	42%	78%	+36 pp	single model, named in source	judge unspecified — caveat	[nbj2026]
4	LangChain Skills	Claude Code task pass	29%	95%	+66 pp	Claude Code	scoring methodology unspecified — flag potential same-family bias if LLM-judged	[lch-skills2026]
5	GCC paper	SWE-Bench-Lite (via self-replication)	11.7%	40.7%	+29 pp	Claude (same model both runs)	deterministic (test-based)	[gcc-arxiv]
6	ACE framework	agent benchmarks (mean)	baseline	baseline + 10.6%	+10.6%	model held; harness varies	mixed (deterministic + judged across suite)	[ace-arxiv]

What makes this table operator-grade is the column that is not there: model version. “Wait for better models” presumes the path from 33% to 49% on SWE-bench Verified passes through a model upgrade. Anthropic’s own blog shows it does not — same Sonnet, +16 points, harness rewrite. The other five rows tell the same story across Terminal Bench, an internal coding bench, Claude Code pass rate, SWE-Bench-Lite, and an agent-benchmark mean. Variable: harness. Constant: model [boh-p3].

Takeaway: Five independent organizations, six receipts, one constraint — model held constant. The delta is the harness.

Receipt 1 — LangChain Deep Agents on Terminal Bench

LangChain’s harness-engineering write-up reports a 52.8% baseline and a 66.5% post-harness score on Terminal Bench 2.0 with gpt-5.2-codex held constant across both runs [lch-harness2026]. The benchmark is 89 long-horizon coding tasks scored deterministically, so the headline is not a vibes-judge artifact. The change set was harness-only: reasoning allocation moved to the sandwich shape (max-tier at plan and verify, mid-tier in implementation), plus a verification-loop middleware so the agent could not declare a task done without running its own tests.

This receipt anchors the chapter because of independence. Terminal Bench 2.0 is third-party, scoring is deterministic, and gpt-5.2-codex is an OpenAI model — so the receipt is not Anthropic measuring Anthropic or LangChain measuring its own framework on its own bench. The reasoning-allocation mechanic lives in chapter 03 — The Reasoning Sandwich; the verification-loop failure modes this run avoids are catalogued in chapter 12 — Pitfalls. What chapter 10 contributes is the headline: same model, +13.7 points, on a benchmark you can replicate.

Takeaway: Independent benchmark, third-party model, deterministic scoring. +13.7 pp from harness only.

Receipt 2 — Anthropic on SWE-bench Verified

Anthropic reports Claude 3.5 Sonnet moving from 33% to 49% on SWE-bench Verified through harness improvements alone — no retrain, no version bump [anthropic-swe]. SWE-bench Verified is the human-validated subset focused on real GitHub issues evaluated against passing tests, so ground-truth is stronger than vanilla SWE-bench. A +16 point swing from a constant model is the delta most teams attribute to a release. Anthropic attributes it to scaffolding — repo navigation, tool-use prompting, post-edit verification.

The lesson reaches further than the score. SWE-bench Verified requires multi-file coordination (locate, edit, verify no regressions), which is the exact shape chapter 04 — Coordinator Mode covers in production form. Anthropic does not call it “coordinator mode,” but the plan/isolate/verify/summarize workflow is the same shape.

Takeaway: Claude 3.5 Sonnet, harness changes only, +16 pp on SWE-bench Verified. Anthropic’s own data, on a human-validated bench.

Receipt 3 — Nate B. Jones on an Internal Coding Bench

Nate B. Jones, March 2026, reports a single-model swing from 42% to 78% on an internal coding benchmark — a +36 point delta from harness changes [nbj2026]. This is a practitioner-tier receipt with no public primary URL: bench, rubric, and judge are not published, and the report reaches us via secondary research notes. Treat it as a directional signal, not a reproducible result — the lowest-trust row of the six and the only one whose primary source we cannot link.

Included because the magnitude survives heavy discounting. Even halved, +18 points is still larger than most teams’ annual model-upgrade ROI. Flagged because the methodology is not reproducible and the judge is unspecified (a same-family LLM-judge would be a separate caveat on top). Treat it as a lower bound on practitioner-level magnitude, not as a precise number.

Takeaway: Practitioner-tier, internal bench, no public primary source. Included as a directional signal because +36 pp survives even aggressive discounting.

Receipt 4 — Skills on Claude Code (29 → 95)

The largest receipt is also the simplest. LangChain’s Skills write-up reports Claude Code’s task pass rate moving from 29% to 95% with the same model when “Skills” — progressively disclosed markdown files describing tools and workflows — are loaded into the instruction surface [lch-skills2026]. Model unchanged, prompt template unchanged; the harness gained one layer: descriptions in context, full bodies lazy-loaded on demand.

The mechanic is “fewer-tools-beat-many-tools” applied to instructions. Decision accuracy climbs because the model is not paying a context tax on every irrelevant capability. Full argument in chapter 06 — Skills as Information Architecture. Vendor caveat applies (LangChain measuring a LangChain product on a LangChain-defined suite), but the magnitude is documented and the mechanism is reproducible in your own harness.

Takeaway: +66 pp from one harness layer (progressive disclosure). The biggest delta in the table.

Receipt 5 — GCC Paper (11.7% → 40.7% on SWE-Bench-Lite, via self-replication)

The GCC paper — Git Context Controller, arxiv 2508.00031v1 — reports +29 percentage points on SWE-Bench-Lite by reorganizing agent memory as a filesystem hierarchy [gcc-arxiv]. The agent sees a tree of files (prior decisions, intermediate artifacts, verification results), not a flat conversation. Model held constant; harness change is the memory representation; scoring is deterministic (test-based), so no LLM-judge bias on this row.

Methodology caveat: the +29 pp is from a self-replication case study — researchers used Claude to rebuild a Claude-Code-style CLI agent from scratch, then evaluated that rebuild on SWE-Bench-Lite. Without GCC: 11.7%. With GCC bolted on: 40.7%. The gain is the lift from adding GCC’s memory protocol to the rebuilt agent — not a generic “multi-step bench” claim. We name the distinction so readers apply the receipt at the right scope: evidence that memory-as-filesystem is a real harness lever on a deterministic public bench, measured on a controlled rebuild. Full session-memory mechanism (generator/reflector/curator) in chapter 08 — Session-Memory Feedback Loop.

Takeaway: Published paper, SWE-Bench-Lite, 11.7% → 40.7% on a self-replicated CLI. Deterministic scoring, named methodology.

Receipt 6 — ACE Framework (+10.6% on Agent Benchmarks)

ACE — Agentic Context Engineering, arxiv 2510.04618 — is a generator/reflector/curator loop reporting a +10.6% mean improvement across a battery of agent benchmarks [ace-arxiv]. Generator does work, reflector proposes context updates, curator decides what enters durable memory. The loop runs across sessions, so accumulated lessons compound. Lowest-headline-magnitude receipt in the table, widest evaluation surface.

Two reasons it earns its row. First, published paper — trust tier matches GCC. Second, +10.6% is a mean; individual benchmarks in the suite show wider deltas, dragged toward the middle by tasks where the mechanism does not apply cleanly. A double-digit mean across a suite is a stronger generalization signal than a single big delta on one bench. ACE’s curator is the formal version of “session memory” as a harness primitive.

Takeaway: Published paper, +10.6% mean across a suite. Conservative number; mechanism generalizes.

The Pattern

Same across all six: model held constant, harness is the variable, delta rivals or exceeds a typical model-version upgrade. Model-version upgrades within a generation deliver single-digit gains; the harness deltas here — +10.6%, +13.7, +16, +29, +36, +66 — match or exceed that on every row. The harness is doing the work engineers attribute to the model.

Different across the receipts: the specific mechanism. LangChain Deep Agents moved reasoning allocation. Anthropic improved scaffolding around SWE-bench tasks. Nate B. Jones tuned a personal harness. LangChain Skills added progressive disclosure of instructions. GCC reorganized memory as a filesystem. ACE added a curator loop. Six different changes on six different leverage points. If your bench is flat, the first move is not waiting for the next model — it is auditing which of these six levers your harness is leaving on the table. Most teams are leaving multiple.

Takeaway: Six different harness mechanisms produced the six deltas. The lesson is the lever, not any single mechanism.

The Objection You Will Hear

Every receipts table attracts the same three objections. Each has a legitimate steelman and a rebuttal this aggregation specifically anticipates.

“You cherry-picked these.” Steelman: deltas this size attract publication, so a public list of harness wins is biased toward winners by selection — and teams that got no lift do not blog about it. Rebuttal: every source in this table publishes its losses too — LangChain’s harness blog reports max-everywhere underperforming; Anthropic’s SWE-bench post includes scaffolding that did not help. Even adjusting for publication bias, the direction is consistent across six receipts from five organizations over two years, mixed scoring, vendor and academic sources. The mechanism arguments in chapters 03–08 stand on their own merits; the receipts here are calibration, not proof.

“Benchmarks are not real workloads.” Steelman: Terminal Bench, SWE-bench, and internal coding benches are not your production task; your workload may not respond to the same levers. Rebuttal: granted on specific numbers, not on direction. If the harness lever did nothing on real workloads, you would expect at least one of six to be flat or negative. None are. Operator note: when teams replicate a receipt on a public bench and do not see the same lift in production, the cause is usually workload mismatch (your tasks are simpler/harder/shorter/less code-heavy than the benchmark). Mechanism still applies; magnitude rescales.

“The next model release will close the gap anyway.” Steelman: a generation jump could swamp these deltas, wasting the harness investment. Rebuttal: harness deltas and model deltas are additive in published cases, not substitutive. Anthropic’s 33% → 49% rebuild was itself a harness lift on a model already several generations past the previous SWE-bench baseline — harness on top of model, not instead of. The structural claim the six receipts support: harness remains a lever even after you take the new model.

Takeaway: Cherry-pick objection — sources publish losses too, and the direction survives a publication-bias adjustment. Benchmark objection — direction holds across six receipts on different benches. Model-release objection — harness lift stacks on top of model lift, it does not get cancelled.

Do This, Not That

When you see	Default behavior	Suggested behavior	Why
Your bench is flat after a sprint	”Wait for the next model release.”	Audit harness allocations first — reasoning, memory, skills, verification.	Six receipts show harness deltas dwarf model-release deltas in the same period.
A vendor blog posts a benchmark jump	Take the headline, screenshot it.	Read for what was held constant. If model changed, it’s a model post. If model held, it’s a harness receipt.	The column you care about is “what was constant.” Most posts bury it.
A +30pp harness claim from a single practitioner	Discount it to zero.	Discount it by half and still treat it as a directional signal.	+30pp halved is still bigger than most model-version upgrades. The magnitude survives heavy discounting.
Teammate proposes “let’s wait for Claude 5 / GPT-6”	Accept the wait.	Ask: which of the six leverage points are we not using right now?	Waiting is a posture, not a strategy. The leverage points exist on today’s model.
A leaderboard climbs by 3 points on a model release	Read this as proof the model is the lever.	Read this as proof that 3 points is what a release buys; harness work buys 10–60.	3pp is small for a release; 30pp is small for a harness rewrite.
Your harness has no published numbers behind it	Trust intuition.	Run one of these six receipts on your own task suite before believing the magnitude transfers.	Direction transfers; specific magnitude does not. Replicate before quoting.

Takeaway: Audit before you wait. Replicate before you quote. Stack before you celebrate.

Gotchas

Gotcha	Symptom	Fix
Same-family judge inflates the score	Eval uses the same model as both the agent and the judge; the judge marks its own family’s outputs higher.	Cross-family judge or deterministic test-based scoring (Terminal Bench style). Receipts above use deterministic scoring or human validation specifically to avoid this.
Benchmark contamination	The bench’s tasks leaked into training data; the model “passes” tasks it memorized.	Prefer benchmarks released after the model’s training cutoff (SWE-bench Verified human-validated subset is the cleaner reference; vanilla SWE-bench is more vulnerable).
Internal-bench overfit	A practitioner tunes a harness on their personal bench; the harness solves the bench, not the workload class.	Treat single-practitioner numbers as directional, not magnitudinal. Replicate on a public bench before quoting magnitude.
”Held constant” without version pin	”Same model” comparison done across a silent provider update; reasoning-parameter semantics drift.	Pin model version and reasoning-parameter values in eval config. Re-run on every silent provider version bump.
Vendor measures vendor’s own product	LangChain measures Skills on a LangChain-defined task suite; selection bias on what counts as a task.	Include at least one third-party benchmark. Terminal Bench 2.0 plays that role in the table above.
Confusing “delta of means” with “delta on every task”	+16pp on SWE-bench is the aggregate lift; individual tasks may regress.	Inspect per-task pass/fail diffs before declaring no regressions. Aggregate gains can hide localized losses.

Takeaway: Most gotchas reduce to: cross-family judge, pin the model version, demand the per-task distribution, and prefer the third-party bench when stakes are high.

References

[lch-harness2026] LangChain, “Improving Deep Agents with Harness Engineering,” February 2026. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/ — Terminal Bench 2.0 results: 52.8% baseline → 66.5% sandwich on gpt-5.2-codex; harness changes include reasoning-allocation (the sandwich) and verification-loop middleware.
[anthropic-swe] Anthropic, “Raising the bar on SWE-bench Verified,” engineering blog. https://www.anthropic.com/engineering/swe-bench-sonnet — Claude 3.5 Sonnet 33% → 49% on SWE-bench Verified via harness/scaffolding improvements; model held constant across the comparison.
[nbj2026] Nate B. Jones, March 2026 — same-model swing from 42% to 78% on an internal coding benchmark via harness changes. Practitioner-tier; no public primary URL. Originally surfaced in our internal research notes (tacit-web/research/building-org-harness/phase3-compounding-moat.md §1), which references the report without a citable link. Included as a directional signal only; magnitude should be discounted heavily.
[lch-skills2026] LangChain, “Skills,” March 2026. https://blog.langchain.com/langchain-skills/ — Claude Code task pass rate 29% → 95% with progressive-disclosure Skills loaded; same model across both runs.
[gcc-arxiv] “Git Context Controller” (GCC), arxiv 2508.00031v1. https://arxiv.org/html/2508.00031v1 — Self-replication case study: Claude was used to rebuild a Claude-Code-style CLI agent from scratch, then that rebuilt agent was evaluated on SWE-Bench-Lite. Without GCC: 11.7%. With GCC memory protocol: 40.7%. +29 pp delta from harness (memory-as-filesystem) only; deterministic test-based scoring.
[ace-arxiv] ACE — Agentic Context Engineering — generator/reflector/curator framework, arXiv 2510.04618. https://arxiv.org/abs/2510.04618 — +10.6% mean lift across agent benchmarks (and +8.6% on finance tasks); cited via the series outline §08 [Session-Memory Feedback Loop] and tacit-web/research/building-org-harness/phase4-session-memory.md.
[boh-p3] tacit-web/research/building-org-harness/phase3-compounding-moat.md — primary source for the Nate Jones and Anthropic SWE-bench citations (§1), and for the broader argument that models depreciate while harnesses appreciate (§4).

Next chapter: 11 — Build Your Own Harness

One question for the reader: Which of the six leverage points — reasoning allocation, scaffolding, memory, progressive-disclosure skills, verification loops, coordinator structure — is your harness leaving on the table this quarter? If you cannot point to the audit, you are waiting for a model release instead of running the audit.