I/D/E · harness-engineering

Harness Engineering — What This Series Is, and Why You Should Read It in Order

Summary

The model is the CPU. The context is the RAM. The harness is the OS. When models commoditize, the harness becomes the moat. A 13-part series — this hub plus 12 chapters — mapping what to build, in what order, and why your competitor cannot copy it.

Series hub. Read this first to choose where to land. Each chapter stands alone; the order below is the recommended walk-through.

The Receipts, Up Front

Five independent teams. Same models. Harness-only changes. The deltas are not subtle.

TeamBenchmarkBeforeAfterDelta
LangChain Deep AgentsTerminal Bench 2.052.8%66.5%+13.7pp [lch-harness2026]
AnthropicSWE-bench Verified33%49%+16pp [anthropic-swe]
Nate B. JonesInternal bench42%78%+36pp [nbj2026]
LangChain SkillsClaude Code task pass29%95%+66pp [lch-skills2026]
ACE frameworkAgent benchmarksbaseline+10.6%published [ace-arxiv]
GCCSWE-Bench multi-stepbaseline+29pppublished [gcc-arxiv]

Every one of those receipts holds the model constant. The lever was the layer around the model — the harness.

This series is what that layer is, what it contains, and how to build one.

The Thesis in One Paragraph

The model is the CPU. The context is the RAM. The harness is the OS — and the organization wrapping the harness is the platform [phil2026]. When models commoditize (which they are), the harness becomes the durable competitive layer. Harnesses appreciate: every encoded fix prevents a class of future failures, every session adds material to the org’s memory, and the org-specific context that powers production agents does not transfer between companies [boh-p3]. The twelve chapters that follow document what a harness actually is, the four primitives every working system has converged on, the six mechanics that make agents reliable in production, the org layer that turns those mechanics into a compounding asset, the operator playbook that ships in six weeks, and the ten pitfalls that catch teams late if the plumbing is missing.

What This Series Is Not

  • Not a LangGraph tutorial. Frameworks are mentioned where they appear in production stacks; they are not the subject.
  • Not a vendor comparison. CrewAI vs LangGraph vs OpenAI SDK is the wrong axis. The right axis is interfaces vs backends.
  • Not “what is an agent” 101. The series assumes you have shipped one agent and watched it embarrass you in production.
  • Not a Claude Code feature tour. Claude Code’s source is the most readable production multi-agent system available, so the series cites it heavily — but the goal is portable patterns, not vendor advocacy.

The Map — Hub Plus Twelve Chapters

HARNESS ENGINEERING — THE COMPOUNDING STACK
FOUNDATION (read first if new to the stack)

00  Series overview (you are here)
01  What a harness actually is
02  The four primitives every working system has

MECHANICS (the six things that make agents reliable)

03  Reasoning sandwich           — xhigh at edges, standard in middle
04  Coordinator mode             — three layers, file-IPC, fork prefixes  ★ hero
05  Replay safety                — idempotency cache, replay-class taxonomy
06  Skills as information arch.  — progressive disclosure, 29  95
07  Prompt cache as architecture — the 50–70K-token hidden bill
08  Session-memory feedback loop — ACE + Codified Context + LangChain

ORG LAYER (why the harness compounds)

09  The org-context moat         — HBR, Greylock, NFX, Stripe MCP

OPERATOR (how to build one)

10  The numbers that prove it    — compact receipt sheet
11  Build your own harness       — 6-week plan for 3 engineers
12  The ten pitfalls             — symptom · how-teams-hit-it · cheap fix

Linked table of contents

How to Read

There is no single correct path, but four reading orders work better than skimming:

Linear (recommended for first-time readers)Ch01Ch12 in order. ~3 hours. You will rebuild the mental model from first principles and end with an operator checklist.

Mechanics-onlyCh04Ch05Ch06Ch07Ch08. Two hours. If you have already accepted the thesis and want the implementation patterns, this is the spine.

Strategy-first — this hub → Ch09Ch10Ch11. Ninety minutes. For an engineering leader who is choosing build-vs-buy on the harness layer and needs the economic argument before the mechanics.

OperatorCh11Ch12Ch04Ch05. Two hours. For a platform engineer with six weeks and a tight scope; reads the playbook, then the gotchas, then the two hardest mechanics first.

Each chapter ends with: References · Next chapter · One question for the reader.

Prerequisites

The series assumes you can read code in a typed language, have shipped at least one agent that called an LLM API and a tool in a loop, and have watched that agent fail in a way that surprised you. If any of those are missing, the Production Agents deep dive is the right warm-up — it covers the operational surface (idempotency, checkpointing, HITL, cost control) that this series treats as already-internalized.

Helpful but not required: familiarity with LangGraph or a comparable graph-runtime, exposure to the Claude Code or Cursor agent loops, having read either Anthropic’s Effective Context Engineering [anthropic-context2025] or Phil Schmid’s Agent Harness 2026 [phil2026] essay.

Companion Pieces (Already Published)

This series builds on, and cross-links to, work already on the site:

The Question Every Chapter Answers

If you have only six weeks and one platform engineer, what do you build first, and how does that investment compound? The chapters answer it from different angles — mechanics, economics, operator playbook — but they all answer the same question.

Start with Chapter 01 — What a Harness Actually Is, or jump straight to the hero chapter, Chapter 04 — Coordinator Mode, if you want the densest single payload.

References

Harness-engineering Ch 1/13
  1. 1 Harness Engineering — What This Series Is, and Why You Should Read It in Order 12m
  2. 2 What a Harness Actually Is (and What It Is Not) 20m
  3. 3 The Four Primitives Every Working Agent System Has 28m
  4. 4 The Reasoning Sandwich: Why More Thinking Made My Agent Worse 18m
  5. 5 Coordinator Mode: A Working Multi-Agent System, From the Source 32m
  6. 6 Replay Safety: The Bug That Breaks Every HITL Workflow 26m
  7. 7 Skills as Information Architecture, Not Features 22m
  8. 8 Prompt Cache Is Architecture: Designing Around the 50K-Token Mistake 22m
  9. 9 The Session-Memory Feedback Loop (ACE + Codified Context) 26m
  10. 10 The Org-Harness Thesis: Why Context Does Not Transfer 26m
  11. 11 The Numbers That Killed the 'Wait for Better Models' Excuse 14m
  12. 12 Build Your Own Harness: A 6-Week Plan for a 3-Person Team 30m
  13. 13 The Ten Pitfalls (and How to See Them Coming) 20m