The Ten Pitfalls (and How to See Them Coming) | Intentional / Deliberate / Engineering

Prerequisite: Part 12 of the Harness Engineering deep dive. Closing chapter — the seven pitfalls from the gold-seams source-map plus three more from the chapters in this series.

A grid of ten labeled pitfall cards. Left column: the seven gold-seams pitfalls — checkpointer migration, replay duplication, same-family judge, stat-power flakes, Postgres-as-everything, Firecracker DIY, Vercel iad1 residency. Right column: three additional practitioner pitfalls — cache-break drift, context stuffing, max-reasoning-everywhere. Below the right column: a shared structure footer explaining that every pitfall has the same shape — symptom, how-teams-hit-it, cheap fix. — Ten pitfalls worth naming · symptom · how teams hit it · cheap fix

Why This Matters

Most “agent pitfalls” content on the open web is generic. The list reads “agents hallucinate,” “watch your costs,” “test your prompts” — categories rather than mechanisms, vibes rather than fixes. The framing collapses into bullet points a junior could write from a blog skim, and ships the same five lessons six different ways. That style produces conference-talk content; it does not produce a checklist you can take into a sprint.

The ten pitfalls in this chapter are the opposite shape — structural pitfalls with named mechanisms, measurable symptoms, and a single cheap fix per class [ai-gs2026]. Seven come directly from the gold-seams source-map that anchored this series; three come from the chapters in this same series that named the operational shape of the architecture. Each pitfall is the moment a default that looks fine on day one breaks under load or at scale. The seductive defaults are not careless picks. They are the choices a reasonable team makes shipping the POC: eager-load every skill body for full context, match judge model to agent model for one config, share a Postgres because swapping is “easy later.” Each of those defaults is the right answer for one week and the wrong answer for the next twelve months.

The receipts that catch each pitfall are operational, not philosophical. LangGraph issue #536 is a real GitHub bug with a real migration story [ai-gs2026]. arxiv 2410.21819 is the same-family-judge inflation paper [ai-gs2026]. The 50–70K-token cache-break bill is named in the Claude Code source comments [cci2026-gems, §2]. The 29-to-95 receipt for lazy-loaded skills is the LangChain Skills write-up [lch-skills2026]. The pitfalls are not opinions; they are events with sources, and the fixes are plumbing decisions a three-person team can ship in an afternoon.

Takeaway: Generic “agent pitfalls” lists are blog filler. The ten pitfalls here are structural — named mechanism, measurable symptom, cheap fix per class. The plumbing retires the class; the bill is what you avoid by not having to debug the instance.

The Shape Every Pitfall Shares

Every pitfall in this chapter has the same three-part shape — symptom, how-teams-hit-it, cheap fix [ai-gs2026]. The triad is not stylistic; it is the only shape that makes a pitfall operationally useful.

The symptom is what surfaces to the operator. It is the thing the team sees on a Monday morning — an in-flight workflow died, the bill doubled, a “neutral” PR was blocked by a gate that should have been green. Symptoms are diagnostic-friendly when they are concrete; “agents hallucinate” is not a symptom because it does not name an event to inspect. “send_email fired twice on resume” is a symptom because it is one log line away.

How teams hit it names the default-that-fails-late — the choice that looked fine, shipped in a hundred tutorials, and broke after the team absorbed it as an assumption. Naming the default explicitly is what lets a reviewer catch the pitfall in code review instead of in production. Without the default named, the pitfall is invisible until the symptom surfaces; with it named, every PR that touches the default surface is a gate.

The cheap fix is the single piece of plumbing that retires the class — not the instance. The instance is the next send_email-fires-twice. The class is every tool with a side effect, replayed through the LangGraph default. Retiring the class is one replay_class declaration plus a tool_call_results cache; retiring the instance is a one-off patch that does not generalize and that the next tool with side effects will re-introduce. The cheap-fix discipline is to write the plumbing once and let it cover every future instance.

THE SHAPE EVERY PITFALL SHARES — SYMPTOM · HOW · FIX

  ┌──────────────────────────────────────────────────────────────┐
│  SYMPTOM         The Monday-morning event the operator sees │
│  ─────────       (lagging signal; the bill is the receipt)  │
│                                                              │
│  HOW TEAMS HIT   The seductive default that looked fine on  │
│  ─────────────   day one and breaks late                    │
│                                                              │
│  CHEAP FIX       The single piece of plumbing that retires  │
│  ─────────       the CLASS (not the instance)               │
└──────────────────────────────────────────────────────────────┘

Class-not-instance rule:
  INSTANCE FIX → "patch the one workflow that broke"
               → next workflow with same shape: bug returns
               
  CLASS FIX    → "every tool declares replay_class at registration"
               → next workflow with same shape: bug cannot land
               
The receipt of a class fix is that the next instance does not exist.

Takeaway: Symptom is what surfaces. How-teams-hit-it is the default-that-fails-late. Cheap fix is the one piece of plumbing that retires the class, not the instance. Class fixes are the only ones that compound.

Pitfall 1 — Checkpointer Schema Migration Kills In-Flight Runs

Symptom. After a minor LangGraph version bump, in-flight workflow runs die silently on resume. The trace shows the run starting, then a deserialization error on the first checkpoint read, then nothing. Workflows that paused for HITL approval over a weekend are dead by Monday; the user-facing receipt is a stuck task with no recovery path [ai-gs2026].

How teams hit it. LangGraph has no built-in checkpointer migration tool — issue #536 names it directly — and serialization between minor versions can shift in ways that silently break replay [ai-gs2026]. The day-one default is to share one graph_id across schema-changing deploys because the graph is “logically the same” graph; the team upgrades, the schema shifts under the wire, and the in-flight checkpoints become unreadable. The break is invisible at upgrade time because new runs work; the corruption only surfaces on the next resume of an old run.

Cheap fix. Three pieces of plumbing land together [ai-gs2026]. Versioned graph IDs (name@semver) make every schema-changing deploy a fresh graph rather than a corrupted continuation. Blue/green deploy lets old runs drain on the old schema while new runs land on the new one. A CI checkpoint-replay test against the last 50 production checkpoints catches a schema break before the deploy reaches production. Together they retire the class — never share a graph_id across breaking schema changes, never deploy a schema change without a replay test, never assume “minor version” means “safe to skip the regression check.”

Takeaway: Versioned graph IDs plus blue/green plus a CI replay test on recent checkpoints. The default of one graph_id forever is the trap; semver on the graph plus replay-as-gate is the plumbing.

Pitfall 2 — Tool Replay Duplicates Side Effects

Symptom. A workflow paused for a HITL approval, resumed on Monday, and send_email fired twice — once on the original execution, once on the replay. The user gets a duplicate notification; the audit log shows two send events on the same logical request; the team writes a one-off retry-guard for the offending tool and ships the patch with a sigh [ai-gs2026].

How teams hit it. LangGraph’s default behavior on resume is to replay nodes, including the tool calls that ran before the pause. The path-of-least-resistance is that agent code does not classify tool side effects — every tool is treated as a pure function until proven otherwise — and the replay treats send_email, charge_card, and regex_search identically [ai-gs2026]. The instance fix is “wrap send_email in an idempotency key.” The class fix is the one missing from most harnesses.

Cheap fix. A replay_class taxonomy declared at tool registration — the source names unsafe_on_replay as the hard-throw tier; a typical implementation pairs it with a pure tier for freely-replayable tools and an idempotent_with_key tier that dedupes through a tool_call_results cache the harness consults before every replayed call [ai-gs2026]. The taxonomy is enforced at registration time; unsafe_on_replay tools throw hard if the cache does not contain a stored result for the prior call, forcing the resume path to short-circuit to the cached value rather than re-fire the side effect. The chapter on replay safety (Ch05) makes this the load-bearing mechanism behind “stop redundant work.” Without it, every HITL pause is an audit-log incident waiting to happen.

Takeaway: Declare replay_class at registration; cache tool_call_results; let unsafe_on_replay throw hard. The default of “every tool is pure” is the trap; classification-plus-cache is the plumbing.

Pitfall 3 — Same-Family LLM Judge Inflates Pass Rate 5–10pp

Symptom. The eval gate reports a 95% pass rate; the change ships; the agent regresses on real traffic. Investigation surfaces that the judge was Claude-Opus and the agent was Claude-Sonnet — same family, same training distribution, same blind spots [ai-gs2026]. The judge’s perplexity on the agent’s outputs is systematically lower than its perplexity on outputs from other families, and that perplexity gap shows up as a 5–10-percentage-point pass-rate inflation [ai-gs2026].

How teams hit it. Default eval setups use whichever closed model the team has a contract for, and same-vendor judging is the path of least resistance — one API key, one rate-limit pool, one provider on-call. The seductive default is “use a strong model from the vendor we trust” without specifying the family discipline. Source for the inflation magnitude: arxiv 2410.21819 [ai-gs2026]. The mechanism is silent self-preference — the judge family rewards the surface patterns its own training distribution favors, and the agent family produces exactly those patterns.

Cheap fix. Enforce cross-family judging at the routing layer, not at the gate layer [ai-gs2026]. The task_class=judge routing rule on I12 ModelRouter MUST select a family different from the agent’s current binding; I5 EvalGate verifies the family at every run as INV-3 [ai-gs2026]. Routing-layer enforcement is what makes the rule survive model swaps — when the team changes the agent’s model next quarter, the judge selection re-routes automatically rather than silently collapsing into the new family.

Takeaway: Same-family judging inflates 5–10pp via silent self-preference (arxiv 2410.21819). The fix is a routing rule, not a guideline — task_class=judge routes off-family at I12, verified by I5 INV-3.

Pitfall 4 — Stat-Power on the Regression Gate

Symptom. The promotion gate blocks roughly 30% of neutral PRs — PRs that did not change the agent’s behavior in any meaningful way [ai-gs2026]. Engineers learn to ignore the gate, push through with overrides, and the gate’s signal value collapses. The pattern repeats on every regression gate stood up with the wrong statistical-power configuration.

How teams hit it. The seductive default is N=50 evaluation runs with a 3-percentage-point pass-rate threshold. On paper it looks like a reasonable balance — fifty runs is not cheap, three points is not tight. In practice, with normal eval-run variance, N=50 + 3pp flakes about 30% of the time on PRs that are statistically neutral [ai-gs2026]. The gate is mathematically under-powered, not philosophically wrong.

Cheap fix. Rolling N=150 unless a release tag forces a smaller sample, plus an override path with an audit row for the cases where N=150 is not affordable [ai-gs2026]. Rolling-N stabilizes the signal-to-noise without retraining the team to expect a particular sample size on every gate; the audit row keeps the override path observable rather than silent. The chapter on numbers (Ch10) makes the same statistical point about all bench claims — the right sample size depends on the variance you are measuring, and N=50 is too small for the variance most eval suites produce.

Takeaway: N=50 + 3pp flakes 30% of neutral PRs. Rolling N=150 + release-tag override + audit row stabilizes the signal. The default is statistically under-powered, not philosophically wrong.

Pitfall 5 — Postgres-as-Everything Makes “Swap” Aspirational

Symptom. The team needs to swap one piece of state — checkpointer storage to DynamoDB for cross-region failover, HITL state to a managed queue, tool_call_results to Redis — and the “easy swap” of a single interface becomes a multi-week refactor that touches half the codebase [ai-gs2026]. The swap claim from the architecture doc turns out to be aspirational.

How teams hit it. The POC default is one Postgres for everything — I2 Checkpointer, I6 HITLBroker, tool_call_results, I8 identity tokens [ai-gs2026]. Sharing a database is cheap to operate, cheap to back up, cheap to migrate as a unit. The trap is the foreign keys that grow between tables that belong to logically distinct interfaces. By the time the swap is needed, the Checkpointer table has FKs into the HITL table and the tool_call_results table has FKs back into Checkpointer rows. The swap-by-changing-connection-string assumption that justified the single-DB default is no longer true.

Cheap fix. Single Postgres by default, but with a no-cross-FK rule across interfaces and a CI test that enforces it [ai-gs2026]. The CI test confirms that changing the connection string and pointing one interface’s tables at a different database is green — the swap is executable, not just hypothetical. The discipline tax is paid up front, in W1, when nobody yet feels the pain. The receipt is the day the swap is actually needed and lands in an afternoon instead of a sprint.

Takeaway: One Postgres is fine; cross-FKs across interfaces are not. The default of “we’ll worry about the swap later” is the trap; no-cross-FK plus a CI test confirming split-by-connection-string is the plumbing.

Pitfall 6 — Self-Host Firecracker as DIY

Symptom. A team scopes “we’ll just self-host Firecracker for the sandbox layer” into a sprint, the sprint slips, the security-review concerns surface, and twelve to eighteen months later the team has a Firecracker deployment that is not yet HIPAA-compliant, not yet SOC2-attested, and not yet cleared for the EU residency requirement that triggered the original work [ai-gs2026].

How teams hit it. The default-that-fails-late is “Firecracker is open source, the AWS team published the design, it looks tractable.” The Northflank, Manus, and microvm-2026 survey synthesized in the gold-seams source-map clocks the actual scope at 12–18 months plus a dedicated security hire [ai-gs2026]. The visible part of Firecracker is the microVM runtime; the invisible part is the orchestrator, the network isolation, the snapshot management, the security-hardening cycles, and the compliance posture that the managed services have already paid for.

Cheap fix. Reject self-host Firecracker for the POC; pre-stage Tensorlake as the swap candidate for the moment the first HIPAA workflow appears [ai-gs2026]. Tensorlake ships managed Firecracker with HIPAA, SOC2, and EU residency in the box. The D3 switch trigger in the build-your-own playbook (Ch11) names exactly this transition. The class fix is to document the rejection in the decisions doc — explicit “we rejected this and here is why” — so the next engineer who finds it “tractable” reads the rejection before opening the proposal.

Takeaway: Self-host Firecracker is 12–18 months plus a security hire. Reject it for the POC; pre-stage Tensorlake for the HIPAA trigger. The default of “it looks tractable” is the trap; documented rejection plus a switch trigger is the plumbing.

Pitfall 7 — Vercel Sandbox iad1 Residency Landmine

Symptom. A team picks Vercel Sandbox for the ephemeral lane, ships the POC, signs an EU customer six months later, and discovers in compliance review that Vercel Sandbox is iad1-only — US-East — with no EU residency option [ai-gs2026]. The “easy” sandbox pick becomes a regional rebuild on a tight customer timeline.

How teams hit it. The constraint is documented, but easy to miss. The default: pick the sandbox vendor one click from the existing Vercel deployment — same dashboard, billing, on-call. The iad1-only note sits in the vendor docs without the visibility that a compliance-review-stage engineer would have. The pitfall lands silently; the cost lands at the customer-conversation stage.

Cheap fix. Reject Vercel Sandbox at the sandbox-pick stage with an explicit call-out in the decisions doc; pre-stage Tensorlake (EU residency in box) as the swap candidate [ai-gs2026]. The build-your-own decisions table (Ch11) lists this rejection as part of D3 sandbox call-outs. The class fix is to treat residency as a first-class column in the sandbox-vendor matrix from W0 onward, rather than a footnote discovered at the first EU customer.

Takeaway: Vercel Sandbox is iad1-only. The default of “we’ll handle residency later” is the trap; residency-as-first-class-column plus pre-staged Tensorlake is the plumbing.

Pitfall 8 — Cache-Break Drift Bleeds 50–70K Tokens Silently

Symptom. The input-token bill creeps up over a few sprints with no obvious cause. No new feature shipped that should have moved the bill; no model changed. A careful read of the per-conversation traces shows the system prompt being re-written rather than served from cache on a growing percentage of long-running conversations. Each accidental cache break costs roughly 50–70K tokens of prefix recomputation [cci2026-gems, §2].

How teams hit it. The day-one default is to treat the prompt cache as a billing optimization rather than as architecture. A “minor” system-prompt edit lands before the boundary marker. A timestamp is computed per turn instead of memoized at process start. A tool prompt is tweaked when an engineer notices an improvement, rather than batched on a known cadence. Each edit is invisible at the patch level; the bill is the lagging indicator. The cache-write tier costs more than the cache-read tier on the same prefix, so every accidental break pays the write premium and loses the read discount on the same 50–70K tokens (see Ch07 for the pricing-tier breakdown).

Cheap fix. Four pieces of plumbing land together — same shape as Ch07 [cci2026-gems, §2]. A SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in the system prompt source splits the cacheable prefix from the volatile tail. A DANGEROUS_uncachedSystemPromptSection(_reason) helper forces a written justification on any per-turn-recomputed section. getSessionStartDate = memoize(getLocalISODate) captures the date once at process start so it cannot drift. A tengu_prompt_cache_break telemetry channel tracks per-tool schema hashes, beta headers, and fast-mode state so the team sees the regression on the day it lands. See Ch07 — Prompt Cache Is Architecture for the full walk-through.

Takeaway: Cache stability is architecture, not a runtime knob. Boundary marker, hostile helper, memoized date, break telemetry. The default of “we’ll watch the bill” is the trap; four pieces of plumbing in the source is the fix.

Pitfall 9 — Context Stuffing Re-Invents Eager Loading

Symptom. The team adds skills to the harness because the architecture promised lazy loading; the receipt does not move; the agent’s pass rate stays flat or regresses. A look at the actual system prompt in production shows every skill body eagerly loaded into context — not just the descriptions [lch-skills2026].

How teams hit it. The seductive default is to load every skill body up front because the model “might need it.” The intuition is correct for a single skill — having the body in context means no second turn — and wrong for a catalog. The 29-to-95 receipt on the LangChain Skills write-up is exactly the swap from eager-load to lazy-load: same Claude Code, same task suite, same instructions, only the retrieval architecture changed [lch-skills2026]. Re-introducing eager loading at the skills layer re-invents the tax at the next layer up.

Cheap fix. Descriptions in context, bodies on disk, the harness gates retrieval [lch-skills2026]. The harness reads the description shelf, the model picks a skill by name, the harness reads the body and places it in the next context window only when invoked. Working-set size — not catalog size — is the metric. Cap the description shelf below the knee where natural-language retrieval starts surface-matching rather than action-matching. See Ch06 — Skills as Information Architecture for the full retrieval mechanic.

Takeaway: Eager-loading every skill body re-invents the tax at the next layer. The fix is description-in-context, body-on-disk, harness-as-gate — the same shape as the cache prefix, one layer up.

Pitfall 10 — Max Reasoning Everywhere

Symptom. The bill scales with the number of agent calls; the eval scores do not move. The team is paying the highest reasoning tier — extended-thinking, max tokens, full reasoning budget — on every task class, including the cheap classifications and the simple lookups that do not benefit from the extra compute [ai-gs2026].

How teams hit it. The seductive default is “always use the strongest model and the longest reasoning budget so quality cannot be the problem.” The framing assumes reasoning compute is a free quality lever. In practice, reasoning is a class-of-task decision — escalation pays off on the hard tasks and is a tax on the easy ones [ai-gs2026]. A classification task that a smaller model gets right at one-tenth the cost is wasted spend at the max-reasoning tier; a code-generation task that requires multi-step planning genuinely benefits from the extra compute.

Cheap fix. Route by task_class at the I12 ModelRouter; cheap_classify tasks get the cheap model with a small budget; hard task classes escalate [ai-gs2026]. The build-your-own decisions D6 (per-workflow LLM-call cap) and D10 (open-weight share threshold for cheap_classify) name the operational surface for this routing [ai-gs2026]. The class fix is that escalation is a routing rule, not a default — the routing decision is observable in trace data, the cost-per-task-class is attributable through I11, and the override path is auditable.

Takeaway: Max reasoning everywhere is wasted spend on easy task classes. Route by task_class at I12; let cheap classify get the cheap model; escalate only the hard classes. Reasoning is a class-of-task decision, not a default.

Do This, Not That

Pitfall	Naive	Correct	Why
LangGraph checkpointer migration	One `graph_id` forever; upgrade in place	Versioned graph IDs (`name@semver`) + blue/green deploy + CI replay test on the last 50 production checkpoints	Issue #536: no built-in migration tool; serialization shifts between minor versions silently break replay [ai-gs2026]
Tool replay on resume	Every tool treated as pure; one-off retry-guards when a side effect duplicates	`replay_class` declared at registration (`pure` / `idempotent_with_key` / `unsafe_on_replay`); `tool_call_results` cache consulted before every replayed call	Class-not-instance: classification-plus-cache retires every future duplicate, not just the one that fired today [ai-gs2026]
LLM judge for the eval gate	Same family as the agent (Sonnet judging Sonnet, GPT-4 judging GPT-4o)	Cross-family judge at the I12 routing rule: `task_class=judge` selects a different family from the agent’s binding; I5 INV-3 verifies	Same-family judging inflates pass rate 5–10pp via silent self-preference (arxiv 2410.21819); routing-layer enforcement survives model swaps [ai-gs2026]
Regression-gate sample size	N=50 + 3pp threshold	Rolling N=150 unless release tag; override path with audit row	N=50 + 3pp flakes ~30% on neutral PRs; under-powered statistically, not philosophically [ai-gs2026]
Postgres-as-everything	One DB; foreign keys across I2 / I6 / `tool_call_results` / I8	One DB by default (POC); no cross-interface FKs; CI test confirms split-by-changing-connection-string is 🟢	Cross-FKs turn the swap claim aspirational; the no-cross-FK rule is what keeps D8 actually executable [ai-gs2026]
Sandbox for HIPAA workflows	Self-host Firecracker because it “looks tractable”	Tensorlake (managed Firecracker with HIPAA + SOC2 + EU residency in box)	Self-host Firecracker is 12–18 months + a security hire per the vendor survey [ai-gs2026]
EU residency for sandboxes	Vercel Sandbox because it is one-click adjacent to the existing Vercel deploy	Reject Vercel Sandbox at the sandbox-pick stage; pre-stage Tensorlake (EU residency in box)	Vercel Sandbox is iad1-only and the constraint is documented but easy to miss [ai-gs2026]
System-prompt cache stability	Treat the cache as a billing optimization	`SYSTEM_PROMPT_DYNAMIC_BOUNDARY` marker + `DANGEROUS_uncachedSystemPromptSection(_reason)` helper + `memoize(getLocalISODate)` + `tengu_prompt_cache_break` telemetry	Each accidental break costs ~50–70K tokens at the write-premium tier; the bill is the lagging signal, the code is the leading one [cci2026-gems, §2]
Skill loading	Eager-load every skill body into the system prompt	Descriptions in context, bodies on disk, harness gates retrieval; cap shelf size below the natural-language-retrieval knee	The 29-to-95 receipt is exactly this swap; eager loading re-invents the tax one layer up [lch-skills2026]
Reasoning budget	Max reasoning on every task	Route by `task_class` at I12; cheap_classify gets the cheap model; escalate only the hard classes	Reasoning is a class-of-task decision; the bill scales with calls but the signal does not [ai-gs2026]

Takeaway: One row per pitfall; one piece of plumbing per row. The default in column two is the trap; the plumbing in column three retires the class. Read the “Why” column when reviewing a PR that touches any of these surfaces.

What the Pitfalls Teach About the Series

The ten pitfalls map back to the architectural chapters with no slack. Checkpointer migration and tool replay are the operational shapes of replay safety (Ch05) — the chapter names replay_class and tool_call_results as the load-bearing primitives, and the pitfalls show what happens when those primitives are missing. Cache-break drift is the operational shape of prompt-cache-as-architecture (Ch07) — the four guardrails in that chapter are the plumbing that retires the class. Context stuffing is the operational shape of skills-as-information-architecture (Ch06) — descriptions in context, bodies on disk, harness as gate. Same-family judging and max-reasoning-everywhere are routing decisions at I12 ModelRouter, named in the build-your-own playbook (Ch11). Stat-power on the gate is a numerical-rigor decision named in the numbers chapter (Ch10). Postgres-as-everything, Firecracker DIY, and Vercel iad1 are the operational shapes of the build-vs-buy framing (Ch11) — own the interface, rent the backend, document the rejection.

The series finale shape is deliberate. The pitfalls are the negative space of the architecture — the moments where a default that looks fine on day one breaks late enough to feel like an unforced error. Each pitfall has a chapter that named the invariant; each fix is the plumbing the chapter prescribed; each receipt is the consequence of skipping the plumbing. Read the ten pitfalls as a final pre-flight on the architecture — if any one of them is on the open list, the corresponding chapter has a plumbing decision you have not yet shipped.

Takeaway: Ten pitfalls; ten plumbing decisions; ten chapter cross-references. The series ships with the architecture in the chapters and the negative space here. Read this chapter as a pre-flight on every chapter that preceded it.

References

[ai-gs2026] tacit-web/research/agent-infra/03-gold-seams.md — Gold-seams source map for production-agent harness patterns, 2026-04-27, §“Must Avoid” (the 7 pitfalls). Primary source for: LangGraph checkpointer schema migration kills in-flight runs (issue #536, no built-in migration tool, versioned graph IDs + blue/green + CI checkpoint-replay test mitigation); tool replay duplicates side effects (replay_class taxonomy + tool_call_results cache + unsafe_on_replay throws hard); self-host Firecracker as 12–18 months plus a security hire per the Northflank/Manus/microvm-2026 survey, mitigated by Tensorlake managed Firecracker with HIPAA + SOC2 + EU residency; Vercel Sandbox iad1-only EU residency landmine; same-family LLM judge inflates pass rate 5–10pp (arxiv 2410.21819) mitigated by cross-family mandate, I5 INV-3, and I12 routing rule; regression-gate stat-power on N=50 + 3pp flakes ~30% on neutral PRs, mitigated by rolling N=150 unless release tag plus audit row; Postgres-as-everything cross-FK trap, mitigated by no-cross-FK rule and CI test confirming split-by-changing-connection-string is 🟢. Also primary source for: the I12 ModelRouter task_class=judge routing-rule enforcement of the cross-family judge invariant; the D6 per-workflow LLM-call cap and D10 open-weight share threshold that frame reasoning as a class-of-task decision; the build-vs-buy “own interfaces, rent backends” framing that anchors the Tensorlake / managed-Firecracker conclusion.
[cci2026-gems] tacit-web/research/cc-internals/src-analysis-07-hidden-gems.md, §2 “PROMPT CACHE STABILITY — Obsessive Engineering,” 2026-04-01. Direct source analysis of /Users/ketankhairnar/Downloads/claude-code-src/. Files: constants/prompts.ts, constants/systemPromptSections.ts, services/api/promptCacheBreakDetection.ts. Primary source for: ~50–70K wasted tokens per cache break; SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker with “do not remove or reorder without updating cache logic” warning; DANGEROUS_uncachedSystemPromptSection(_reason) helper requiring a written justification; getSessionStartDate = memoize(getLocalISODate) with the “stale date after midnight vs ~entire-conversation cache bust — stale wins” rationale; monthly granularity in tool prompts; tengu_prompt_cache_break telemetry tracking per-tool schema hashes, beta headers, and fast-mode state. Cited inline in Pitfall 8 (cache-break drift) and the corresponding Do-this-not-that row; forwards readers to Ch07 for the full architecture.
[lch-skills2026] LangChain, “Skills,” March 2026. https://blog.langchain.com/langchain-skills/ — Claude Code task pass rate 29% → 95% with progressive-disclosure Skills loaded; same model, same task suite, same instructions across both runs; the only variable is whether skill bodies were eagerly loaded into the system prompt or kept on disk with descriptions surfaced in context. Cited inline in Pitfall 9 (context stuffing) and the corresponding Do-this-not-that row; forwards readers to Ch06 for the full retrieval mechanic.

One question for the reader: For each pitfall, point to three things in your harness: (a) the line where the default is set, (b) the plumbing that retires the class — not the instance, (c) the telemetry or CI test that catches the regression the day it lands, not the day the bill arrives. If any of the three is missing for any pitfall, the bill is the lagging signal you are still relying on. The plumbing is the leading one, and the cost of installing it is an afternoon per row.

Series finale: This is the closing chapter of the Harness Engineering deep dive. The architecture lives in chapters 01–11; this chapter is the negative-space pre-flight.