What a Harness Actually Is (and What It Is Not) | Intentional / Deliberate / Engineering

Prerequisite: Part 1 of the Harness Engineering deep dive. Foundation chapter — every later chapter uses the four-layer stack defined here.

A four-layer stack drawn as a tower. From bottom to top: MODEL labelled CPU with example weights and attention; CONTEXT labelled RAM with system prompt and transcript; HARNESS labelled OS with tools, skills, subagents, cache, replay; ORGANIZATION labelled PLATFORM with workflows, memory, decisions. Arrows on the right show value-capture moving up the stack as the lower layers commoditize. — The four-layer stack

Why This Matters

Most teams that say they are “doing harness work” are tuning system prompts. The labels are the same; the layers are different. The team that rewrites the system prompt every morning and gets back two points of accuracy thinks they are engineering a harness. They are not. They are operating on the context layer with a prompt-engineering rubric, then naming the activity after the layer above it. The cost is structural: the levers that move agents from demo to production sit above the prompt layer, and you cannot reach them by editing prompts harder.

Three specific arguments this chapter counters. First, the “harness is just a better system prompt” claim — popular on LinkedIn agent posts, wrong about where reliability comes from. Second, the “LangGraph IS the harness” claim — common in framework comparisons, conflating substrate with the discipline that runs on it. Third, the “context engineering is the same thing as harness engineering” claim — Harrison Chase’s framing has overlap, but the two are not identical and the difference matters when you decide what to staff. Each of those claims puts a team’s investment one layer below where the gain is, and each one is fixable once the stack is drawn properly [phil2026, “Agent Harness 2026”][bock2026][hwc2026, “Context Engineering for Agents”].

This chapter is the load-bearing definitions used across the rest of the series. Every later chapter assumes you can place a question on the four-layer stack — model · context · harness · organization — and answer “which layer does this lever pull?” Get that wrong and the receipts in chapter 10 read like luck. Get it right and the receipts read like a stack diagram with numbers attached. Every framing is cited; every distinction is grounded in a named source.

Takeaway: Before you can engineer a harness, you have to be able to point at it. Most teams cannot, and that is why their harness work is prompt work in disguise.

The Stack: Model · Context · Harness · Organization

Four layers, stacked. Each layer is responsible for a different class of decision, fails in a different way, and gets investment from a different role on the team. Phil Schmid’s “Agent Harness 2026” essay is the cleanest public statement of the CPU/RAM/OS analogy [phil2026, “Agent Harness 2026”]; what makes it more than a metaphor is that the substitution rules from operating systems carry over. A CPU runs whatever the OS schedules. RAM holds whatever the OS loads. The OS is where the policy lives. Same here.

Model is the CPU. It is the weights, the attention mechanism, the reasoning channel, and the per-call budget. The model decides one thing per turn: what the next token should be, given the bytes in front of it. A model is commodity in the same sense a CPU is commodity — you pick the tier, you pay the price, and the part you actually buy is fungible across vendors at a similar tier. The failure mode of pushing the model past its scope is the “wait for better models” posture: you keep swapping CPU generations expecting agent behaviour to fix itself, and you get the same single-digit lift each release because the lever was never the CPU [boh-p3, §4].

Context is the RAM. It is the system prompt, the running transcript, tool results, retrieved documents, and short-lived memory that the model can see right now. The context layer is what context engineering operates on — Anthropic’s framing is the cleanest: write (decide what enters), select (decide what survives), compress (decide what summarizes), isolate (decide what stays out) [anthropic-context2025, §“Four Operations”]. Like RAM, the context is finite, volatile, and shared across instructions and data. The failure mode of pushing the context past its scope is “context stuffing” — load everything upfront, watch decision accuracy degrade as input length grows, blame the model for losing the plot.

Harness is the OS. It is the tools the agent can call, the skills it can load on demand, the subagents it can spawn, the inter-process protocol between them, the prompt cache discipline, the replay-safety guarantees, the verification loops that fire before “done,” and the telemetry hooks that catch regressions. The harness decides which model gets which context at which moment, what tools are visible, what happens when a worker dies mid-call, and what survives a restart. The failure mode of pushing the harness past its scope is the inverse of the prompt-stuffing one: a team builds a beautiful coordinator and forgets that the organization has no way to keep its lessons.

Organization is the platform. It is the codified workflows, the institutional memory, the post-incident decisions, the AGENTS.md files, the runbooks, and the slow accumulation of “what this company knows that nobody else does.” Harvard Business Review calls it “demonstrated execution”: the workflows teams actually follow across systems, the signals they respond to, the order in which roles get involved [boh-p3, §2]. The failure mode at this layer is the most expensive: a team builds a strong harness, ships an agent, and on the day the lead engineer leaves the harness’s institutional knowledge walks out with them because nothing was encoded in files the agent can read.

THE FOUR-LAYER STACK (with concrete examples)

┌──────────────────────────────────────────────────────────────┐
│  ORGANIZATION   (PLATFORM)                                   │
│   AGENTS.md · workflows · post-incident decisions · runbooks │
│   "what your team alone knows how to do"                     │
├──────────────────────────────────────────────────────────────┤
│  HARNESS        (OS)                                         │
│   tools · skills · subagents · IPC · prompt cache · replay   │
│   verification loops · telemetry · sandbox lifecycle         │
├──────────────────────────────────────────────────────────────┤
│  CONTEXT        (RAM)                                        │
│   system prompt · transcript · tool results · retrieval      │
│   short-lived memory (write / select / compress / isolate)   │
├──────────────────────────────────────────────────────────────┤
│  MODEL          (CPU)                                        │
│   weights · attention · reasoning budget · per-call tier     │
└──────────────────────────────────────────────────────────────┘
    ▲ value moves UP as lower layers commoditize

Takeaway: Four layers, four decisions, four failure modes. If you cannot place a lever on this stack, you cannot debate whether it belongs in your harness.

What Schmid Underspecifies

Schmid’s essay gives us the stack. What it does not give us — and what the rest of this series fills in — is the operator playbook against the stack [phil2026, “Agent Harness 2026”]. The framing is correct, the analogy holds, the layer count is right, but the essay does not say what an engineer should do on each layer this quarter, in what order, with what budget.

Where Schmid draws the stack, chapter 11 draws a 6-week schedule against it; where Schmid names the OS layer, chapter 04 reads the OS-equivalent code shipping inside Claude Code today. The substitution is the contribution; the operator playbook is the gap.

Takeaway: Schmid gives us the stack. The series gives us the operator playbook against it.

The Bockeler Framing

Birgitta Bockeler’s “Harness Engineering” article on Martin Fowler’s site frames harness engineering as a discipline that sits above prompt engineering and context engineering, not as a renaming of either [bock2026]. Bockeler clusters the harness’s job into three concerns [boh-p3, §10][bock2026]:

Context — what gets loaded into the agent’s working memory on this turn: files, retrieval results, prior turns.
Constraints — what the agent is and is not allowed to do: tool permissions, sandbox boundaries, write paths.
Garbage collection — how the harness reaps stale state, cleans up resources, and manages session lifecycle so old work does not poison new work.

The triad we will track across the rest of this series is the chapter’s own synthesis, extending Bockeler’s framing: constraints, verification, lifecycle. Verification — what counts as “done” — does not appear by that name in Bockeler’s three, but it is implicit in every harness she discusses and is load-bearing enough in production that we name it explicitly. Lifecycle is the broader cousin of Bockeler’s “garbage collection”: we want the same cleanup discipline plus the start/persist/resume side of the same problem. Constraints we keep verbatim.

Whichever triad you prefer, the same observation holds: these are the places a context-engineering effort cannot reach. You can write the most elegant system prompt in the world and it cannot stop a worker from writing through a symlink, cannot decide whether a test pass is enough to ship, cannot reap a stale scratchpad, and cannot resume a workflow when the process dies between two tool calls. Those jobs require code: O_NOFOLLOW on the output file, a verification middleware that fires before completion, an idempotency cache keyed on tool call IDs, an eviction policy on the memory directory. That is harness work. The fact that the prompt layer cannot do any of it is the strongest case for treating the harness as a separate discipline.

Takeaway: Bockeler names context, constraints, and garbage collection. This series tracks constraints, verification, and lifecycle as the operator’s working set. Either way, the prompt layer cannot do any of them.

What a Harness Is Not

The clearest way to define the harness is to draw a circle around it and put the things people confuse it with on the outside. Four neighbors are routinely mistaken for the harness itself.

A framework is not the harness. LangGraph, CrewAI, LangChain Deep Agents, the OpenAI Agents SDK — these are substrates the harness is built on. A framework gives you a graph runtime, a tool-call abstraction, a streaming API, and a serialization format. None of that is your harness. Your harness is the policy that runs on the framework: which subagents you spawn for which tasks, which tools you expose to which workers, what your verification loop checks before declaring done, what your idempotency cache keys on. Two teams using the same framework can have radically different harnesses, and one will ship while the other will not. The “LangGraph IS the harness” claim collapses substrate and policy. Substrate is necessary; substrate is not sufficient.

Prompt engineering is not the harness. Prompt engineering works on the model + context boundary. It is the discipline of choosing words, ordering instructions, and shaping the immediate input the model sees on this turn. Excellent prompt engineering is a prerequisite for a working agent and a complete substitute for nothing higher up the stack. A team that spends six weeks rewriting prompts to fix flaky multi-step behaviour is operating on the context layer, treating symptoms of harness failures (missing verification, missing replay safety, missing tool-result cache invalidation). The prompt is downstream of the harness’s decisions, not the other way around.

Context engineering is not the harness either — though this is the closest neighbor and the one most likely to provoke an argument. Harrison Chase has called context engineering the agent moat (the layer where defensibility accrues) [hwc2026, “Context Engineering for Agents”]; Anthropic’s “Effective Context Engineering” essay names it as a distinct discipline with four operations — write, select, compress, isolate [anthropic-context2025, §“Four Operations”]. The two framings agree on the activity but place the boundary differently. Context engineering operates at the context + harness boundary: it decides what enters the model’s RAM. Harness engineering owns the harness layer in full: it decides what tools exist, what runs in parallel, what survives a restart, and what the verification loop counts as done. Context engineering is a subset of harness engineering in Bockeler’s framing, and a peer in Chase’s. The dominant industry term today is “context engineering” — Chase, Willison, Karpathy, Lutke, and Schmid all use it as the umbrella label, and Bockeler’s narrower scope is the minority position [boh-p3, §10]. We adopt the narrower framing in this series because it puts production-reliability levers — verification, lifecycle, replay safety — at the layer where they belong, not as features of the prompt. If you prefer Chase’s peer framing the rest of the chapters still work; you will just have to translate “harness owns verification” into “context engineering’s harness peer owns verification” every few pages.

The agent is not the harness. The agent is the application running on top — the user-facing thing that takes a goal and returns a result. Two engineers reaching for the word “agent” can mean two different things: one means the user-facing product surface, the other means the harness underneath. Schmid’s framing keeps them separate. So does this series.

Confusion	What it actually is	Where it operates	When the boundary blurs
Framework (LangGraph, CrewAI, Agents SDK)	A substrate the harness is built on	Below the harness	Frameworks ship opinionated middleware that does some harness work
Prompt engineering	Word-choice and instruction-shaping for the current turn	Context layer (model + context boundary)	A system prompt that encodes verification policy is harness work expressed as prose
Context engineering	Deciding what enters the context window: write / select / compress / isolate	Context + harness boundary	Subagent isolation is harness-owned, but its purpose is context curation
The agent	The user-facing application running on top	Above the harness	”Agent” colloquially gets used for the whole stack — disambiguate before debating

Takeaway: Framework is the substrate. Prompt engineering operates a layer below. Context engineering operates at the boundary. The agent is the application on top. The harness is the OS where production reliability is engineered.

The Boundary Problem: When Layers Bleed

The four layers are clean enough to teach with and messy enough in practice that operators need to know where the bleed happens. Three boundaries blur reliably, and pretending they do not is how teams end up debating “is this a context fix or a harness fix” instead of fixing the thing.

Memory implementations span context + harness. The system prompt that says “you may consult /notes/session.md” lives in the context layer; the file lifecycle, the per-worker key prefix, the lock policy, the disk eviction rule all live in the harness layer. ACE (Agentic Context Engineering, arxiv 2510.04618) names a generator/reflector/curator loop in the context-engineering literature, but its curator runs in the harness — the curator is a separate call with its own model, its own budget, and its own write surface, scheduled by the harness against an organizational policy [ace-arxiv][boh-p3, §8]. If you call this “context engineering” you under-resource the lifecycle work; if you call it “harness engineering” you under-resource the prompt design. The honest answer is that memory is a cross-cut and needs ownership in both halves.

Skills span harness + context. A skill is a markdown file the model reads on demand [hwc2026, “Skills”]. The decision to load it — by name, by directory listing, by description match — is a harness decision; the prose inside the skill is a context-layer artifact authored with prompt-engineering rules. The 29% → 95% Claude Code result reported on LangChain’s Skills writeup comes from a harness change (progressive disclosure of skill bodies) producing a context-layer effect (decision accuracy on a fixed test set) [lch-skills2026]. Chapter 06 walks through this in detail; what matters here is that “skills” is not in one layer.

Coordinator mode spans harness + organization. The coordinator/worker/scratchpad pattern is a harness construct [cci2026, §1]; the decision about which workflows get a coordinator, which roles approve plans, and which kinds of work justify spawning workers is an organizational one. A harness that lets anyone spawn a coordinator without policy is not a feature, it is a bill. The same primitive that makes the harness powerful makes the organization’s policy load-bearing.

The bleed is not a defect. It is a feature of any expressive stack — OSes have the same problem at the kernel/userspace boundary, and the answer there is the same as the answer here: name the boundary, name the owner, and document the cross-cut. The series will name these explicitly as they come up. The point of this chapter is that an operator who knows the bleed exists can plan around it; one who pretends the layers are airtight will spend the year arguing about whether a thing is “really” context engineering.

Takeaway: Three boundaries blur in practice — memory, skills, and coordinator policy. Name the bleed; do not pretend the layers are airtight.

Why the Distinction Is Load-Bearing

The four-layer stack is not a taxonomic preference. Mis-placing a lever has three concrete consequences, each of which can be measured on a team’s spend or shipped on the calendar.

Consequence one: teams reach for model upgrades when they should rebuild the harness. Anthropic’s own data shows Claude 3.5 Sonnet moving from 33% to 49% on SWE-bench Verified through harness changes alone — no retrain, no version bump [anthropic-context2025, §“Tool Design”][boh-p3, §6]. The six receipts catalogued in chapter 10 — The Numbers That Prove It (LangChain DA, Anthropic SWE-bench, Nate B. Jones 42%→78%, LangChain Skills 29%→95%, GCC, ACE) extend the same finding across five independent organizations, all double-digit, all model-held-constant. A team that interprets a flat benchmark as “the model isn’t smart enough yet” will write the next quarter off waiting for a release that delivers a fraction of what a harness rebuild would. The mis-attribution is what costs the quarter, not the model.

Consequence two: teams over-invest in prompt tweaks when the gain is at the IPC layer. The coordinator pattern in chapter 04 — fork primitive, XML task notifications, file-based mailboxes — produces multi-agent reliability gains that no system-prompt rewrite can reach. A team that diagnoses “the agents are talking past each other” as a prompt problem will spend the cycle rewording prompts; the real fix is a fixed XML envelope on the IPC layer with status, summary, result, and usage fields. The prompt layer and the IPC layer are both real, but they do different work. Prompts shape the model’s behaviour on this turn; IPC determines what survives the turn boundary.

Consequence three: orgs hire prompt engineers when they need platform engineers. This is the most expensive version of the mistake and the one that takes longest to notice. The job title “prompt engineer” maps to the context layer. The job title “platform engineer” or “AI infra engineer” maps to the harness and organization layers. A team that staffs prompt engineers exclusively will have excellent system prompts and no replay safety. A team that staffs platform engineers exclusively will have flawless idempotency and an agent that says “I am ready to help” three different ways. You need both; the staffing decision is layer-aware. The first sentence of any agent-team JD should say which layer the role owns.

The cumulative effect: harness work gets confused with prompt work, prompt work with context work, context work with model selection. Money flows toward the visible artifact and the quotable lever, and the harness — doing the load-bearing structural work in the middle — gets a junior assignment because nobody can point at it on a diagram. The four-layer stack is the diagram.

The strongest counter-view. Not everyone agrees the harness layer is where the leverage is. Boris Cherny, an engineer on Anthropic’s Claude Code team, has argued that “all the secret sauce is in the model” and that the harness should be “the thinnest possible wrapper” around it [boh-p3, §9]. Noam Brown at OpenAI has made the related argument that reasoning models will eventually subsume the scaffolding [boh-p3, §9]. Both positions disagree with the thesis of this series. The receipts in chapter 10 are why we disagree back — six double-digit swings with the model held constant are hard to explain if the harness is just packaging. But a reader who finds those receipts unconvincing has Cherny and Brown on their side, and the rest of this series should be read with that disagreement out in the open rather than papered over.

Takeaway: Mis-placing a lever costs a quarter, a budget, or a hire. The four-layer stack is the diagram that prevents the mis-placement — even if you end up siding with Cherny on where the sauce sits.

Do This, Not That

Pattern	Wrong layer	Right layer	Why
Flaky multi-step agent across tool calls	Rewrite system prompt	Harness — add replay-safety idempotency cache	Lifecycle is a harness concern; prompts cannot make a tool call replayable [bock2026]
Two workers clobber each other’s writes	Add a “be careful” line to the prompt	Harness — route write-heavy spawns through worktree isolation	Concurrency is a constraint; constraints are harness-owned [cci2026, §1]
Benchmark flat after sprint	Wait for the next model	Harness — audit reasoning allocation, skills, verification, memory	Harness deltas exceed model-release deltas across six receipts in chapter 10 [boh-p3, §6]
Agent declares done before tests pass	Tell the model to test more carefully	Harness — add a verification middleware that gates “done” on real test output	Verification is a harness concern; self-reports are not evidence [bock2026]
Each session relearns the same lesson	Bigger context window	Organization — encode lesson in AGENTS.md or skills registry	Institutional memory lives at the platform layer, not in RAM [boh-p3, §2]
Tools are slow because the prompt has 200 of them	Ask the model to “only use the relevant tools”	Harness — progressive disclosure: load descriptions, lazy-load bodies	”Fewer tools beat many tools” applied to instructions [hwc2026, “Skills”][lch-skills2026]
Token spend explodes on every parallel spawn	Reduce the number of subagents	Harness — byte-identical fork prefix so workers share the prompt cache	Cache pricing is architecture, not a tuning concern [cci2026, §4]
The same regression keeps slipping back	Add a stricter system prompt about regressions	Organization — record the regression class in a runbook the agent can read	A prompt is per-turn; a runbook persists across sessions [boh-p3, §2]

Takeaway: When a fix lands on a higher layer than the prompt, it usually sticks. When it lands on the prompt, it usually regresses.

Gotchas

Gotcha	Symptom	Fix
Conflating framework with harness	Team picks LangGraph and declares the harness done; ships nothing	Frameworks are substrate. The harness is the policy on top — tools, verification, lifecycle, IPC. Pick the framework, then design the harness.
Calling memory work “context engineering” only	Memory ships without lifecycle policy; sessions corrupt each other	Memory is a cross-cut. Stage the context layer (what enters) and the harness layer (eviction, lock policy, write surface) as separate workstreams with separate owners.
Hiring only prompt engineers for an agent team	Excellent prompts, no replay safety, no verification, no idempotency	Layer-aware staffing: at least one engineer owning the harness layer end-to-end. The first sentence of the JD names the layer.
Treating context engineering and harness engineering as synonyms	Operator scope confusion; constraints and lifecycle work fall through the cracks	Adopt Bockeler’s stronger framing for this series: context engineering is a sub-discipline of harness engineering. Document the choice so the team uses one vocabulary.
Attributing harness failures to the model	”Wait for the next release” becomes the default response to flat metrics	Run a single harness audit against the four-layer stack before any model swap. The six receipts in chapter 10 show harness deltas dominate model-release deltas in the same window [boh-p3, §6].
Letting the agent layer leak into the harness conversation	”The agent should know to do X” ends every architecture meeting	The agent is the application. The harness decides what the agent can do. When “the agent should…” appears in a design doc, push the decision down to the harness layer where it can be enforced.
Confusing prompt-cache work with prompt engineering	Cache-write costs spike after every prompt edit; nobody knows why	The prompt cache is an architecture concern at the harness layer [cci2026, §4]. Cache-stable prefix design, dynamic-boundary markers, and break telemetry belong with the harness, not the copywriter.

Takeaway: Most gotchas reduce to: name the layer before you name the fix. If a meeting is debating “the agent should…”, it is one layer too high.

What the Stack Teaches About the Rest of the Series

The rest of the series is the stack expanded. Chapter 02 names the four primitives every working harness has converged on. Every later chapter places one mechanic on the stack — reasoning sandwich at the context boundary, coordinator mode at the harness IPC, prompt cache at the harness architecture, session memory at the harness/organization seam.

Takeaway: From here on, every chapter answers one question: which layer does this lever pull? Hold the four-layer stack as you read.

References

[phil2026] Philipp Schmid, “Agent Harness 2026,” February 2026. https://www.philschmid.de/agent-harness-2026 — Introduces the model-as-CPU / context-as-RAM / harness-as-OS / agent-as-application framing this chapter builds on. Public-web essay; the cleanest single statement of the four-layer stack.
[bock2026] Birgitta Bockeler, “Harness Engineering,” Martin Fowler’s site, February 2026. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html — Frames harness engineering as a discipline above prompt and context engineering. Bockeler’s three concerns are: context, constraints, garbage collection. This chapter extends that framing to a working triad of constraints, verification, lifecycle — see §“The Bockeler Framing” for the rationale.
[hwc2026] Harrison Chase / LangChain, context engineering essays and the Sequoia “Context Engineering Our Way to Long-Horizon Agents” podcast. https://blog.langchain.com/context-engineering-for-agents/ and https://www.sequoiacap.com/podcast/training-data-harrison-chase/ — Source for context engineering as the named moat layer and progressive-disclosure Skills as a harness mechanism. We adopt Bockeler’s stronger framing (context-eng ⊂ harness-eng) while acknowledging Chase’s peer framing.
[anthropic-context2025] Anthropic Applied AI Team, “Effective Context Engineering for AI Agents,” September 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents — Four-operation framing of context engineering: write, select, compress, isolate. Used here to scope what the context layer is responsible for.
[cci2026] tacit-web/research/cc-internals/src-analysis-05-agents-coordination.md — Direct source analysis of Claude Code, dated 2026-04-01. Cited here for the harness-layer concerns (tools, subagents, IPC, prompt cache, sandbox lifecycle) covered in depth in chapter 04.
[boh-p3] tacit-web/research/building-org-harness/phase3-compounding-moat.md — Internal research, March 2026. §1 (Mitchell Hashimoto compounding-engineering quote; Nate B. Jones 42% → 78% same-model swing), §2 (HBR organization-as-platform argument), §4 (models depreciate / harnesses appreciate), §6 (practitioner ROI receipts), §8 (ACE generator/reflector/curator in session memory), §9 (Cherny / Brown / METR counter-positions), §10 (Bockeler triad: context, constraints, garbage collection). Source for the organization-layer framing, the harness-vs-model-release comparison, and the counter-positions surfaced in this chapter.
[ace-arxiv] Agentic Context Engineering (ACE), arxiv 2510.04618, 2025. https://arxiv.org/abs/2510.04618 — Source for the generator/reflector/curator loop referenced in the boundary-problem discussion of memory.
[lch-skills2026] LangChain, “Skills” blog post / writeup, 2026. Source for the Claude Code 29% → 95% result on skill-bearing tasks driven by progressive disclosure of skill bodies. Cited alongside [hwc2026] because the result is in the Skills writeup specifically, not in the broader context-engineering essays.

Next chapter: 02 — The Four Primitives Every Working Agent System Has

One question for the reader: Pick the last “agent improvement” your team shipped. On which of the four layers — model, context, harness, organization — did the fix actually land? If you cannot answer, the fix did not stick, and the next quarter’s regression will arrive at the same spot.