Prerequisite: Part 11 of the Harness Engineering deep dive. Operator companion to the architecture chapters (Ch05, Ch07, Ch08, Ch09).
Interfaces are commitments. Backends are choices. Backends change; interfaces don't.
Why This Matters
Most public writing on “build your own agent stack” treats the problem as a stack-pick — which sandbox, which runtime, which observability tool — and produces a comparison matrix that goes stale six weeks after publication. That framing is wrong in a specific, operationally costly way. The agent-infra problem is an interface-design problem [ai-gs2026]. The right foundation owns twelve stable interfaces (I1–I12), rents commodity backends, and lets each backend be swapped within its known difficulty rating without rewriting any agent code. The picks are proven defaults, not commitments.
The chapters before this one established the architectural invariants. Replay safety (Ch05) is the mechanism behind “stop redundant work” — without replay_class enforcement and a tool_call_results cache, every HITL pause re-burns LLM tokens and duplicates side effects [ai-gs2026]. Cache stability (Ch07) is the prefix-side invariant that lets context engineering work at all. Session memory (Ch08) is the loop that turns each session into a deposit. The org-harness thesis (Ch09) names execution-as-moat. This chapter is the operator instantiation — the six-week plan that puts those invariants in place for a three-person team and proves the foundation works by swapping a backend with zero agent-code changes on day six [ai-gs2026].
The source synthesis is direct: “Daytona+E2B for sandboxes, LangGraph+Postgres for runtime, Phoenix for obs, GitHub Actions+Phoenix Evals for promotion, Postgres-LISTEN for HITL, Anthropic-only ModelRouter for v1 with empty-diff swap to multi-provider in W6” [ai-gs2026]. Five concepts, seven pitfalls, ten decisions with triggers, six weeks plus W0 prep. The differentiator over every other “build your own” piece on the open web is that this one names the interface contracts the foundation owns, not the backend vendors of the month.
Takeaway: Treat agent infra as interface design, not stack-picking. Own twelve interfaces; rent commodity backends; defaults are not commitments. The receipt that the foundation works is an empty-diff backend swap in W6 [ai-gs2026].
Foundation = Interfaces, Not a Stack
The north-star phrase from the source is “proven, predictable tech for solid foundation we can experiment a lot on” [ai-gs2026]. Solid foundation means the interfaces do not change when the team experiments. Experiment a lot on means the backends are expected to move — for cost, for compliance, or because a better option shipped two months after the POC. If picks are commitments, experimentation breaks the foundation. If interfaces are stable, experiments stay where they belong: in the backend behind each interface.
Build-vs-buy reduces to one rule: own the interfaces, rent the backends [ai-gs2026]. A three-person platform team has no business writing a Firecracker manager, a checkpointer migration framework, or an OTel collector. They have every business writing the ToolRunner interface that routes by persistence_class, the EvalGate interface that enforces a cross-family judge, and the ModelRouter interface that decides per call which provider gets the request. The vendor under each interface is interchangeable. The interface is not.
This is the same architectural shape Ch07 named for the prompt prefix — a boundary marker in code that splits the stable cacheable part from the volatile tail, enforced by naming rather than by convention. Here the boundary is at the package level: packages/foundation/<interface>/ owns the contract and tests; packages/foundation/<interface>/backends/<vendor>/ owns the swap-target. A W1 CI test grep-asserts that no agent code imports a provider SDK outside the backends directory [ai-gs2026]. Convention rots; package boundaries plus a grep test do not.
Takeaway: Foundation = interfaces + invariants + tests; backends = swappable picks. The boundary is enforced at the package level by a grep test, not by a style guide. Defaults are picks for now with a documented switch trigger, not commitments.
The 12 Interfaces (I1–I12)
The foundation owns twelve interfaces. Five are the load-bearing primitives the agent loop calls directly; five are cross-cutting concerns that wrap or guard the primitives; two are runtime substrate the agent code consumes indirectly [ai-gs2026].
| # | Interface | One-line role | Default backend |
|---|---|---|---|
| I1 | ToolRunner | Routes tool calls by persistence_class to sandbox lane | Daytona (persistent) + E2B (ephemeral) |
| I2 | Checkpointer | LangGraph state persistence; versioned graph IDs | LangGraph + Postgres |
| I3 | WorkflowEngine | Durable execution substrate; engaged via D4 trigger | None (POC); Temporal when triggered |
| I4 | TraceSink | OTel GenAI span emission; replay portability via S3/R2 dump | Phoenix (Arize) |
| I5 | EvalGate | Promotion gate; cross-family judge invariant (INV-3) | Phoenix Evals |
| I6 | HITLBroker | Pause / resume / expire / escalate with audit | Postgres-LISTEN |
| I12 | ModelRouter | Per-call provider routing by task_class + sensitivity | Anthropic-only B1 (POC) |
The remaining five are cross-cutting — they guard or annotate every call through the primitives above [ai-gs2026]:
| # | Interface | Cross-cuts | Default backend |
|---|---|---|---|
| I7 | SecretsProvider | Consumed by I1 (tool keys) + I12 (provider keys) | CF env / .env (POC) |
| I8 | IdentityProvider | Caller / agent identity; consumed by I9 | Custom JWT (POC) |
| I9 | PolicyGate | Guards every I1 ToolRunner + I12 ModelRouter call | Custom YAML rules |
| I10 | RateLimiter + CircuitBreaker | Wraps every external call; recursive-cost-runaway guard | Redis |
| I11 | CostAttributor | Tags every I4 TraceSink span; consumes I12 estimated cost | OpenLLMetry cost calc |
Two structural facts about the list. First, the interfaces are not peers — they form a DAG [ai-gs2026]. I8 IdentityProvider feeds I9 PolicyGate, which guards I1 and I12. I7 SecretsProvider is consumed by both I1 and I12. I11 CostAttributor depends on I4 TraceSink and I12 ModelRouter. The DAG matters operationally because the dependency edges tell you the order packages have to be scaffolded in W1 — secrets and identity before the gates that consume them, gates before the runners they guard. Second, I12 ModelRouter was added in v3.2 of the source plan in response to a direct signal: “open-weight models progressing, need router + common access; route by cost/complexity” [ai-gs2026]. The cross-family judge invariant for I5 EvalGate is enforced at the routing layer, not at the gate layer — task_class=judge is a routing rule on I12 [ai-gs2026].
cross-cutting (guard / annotate) ─────────────────────────────── I8 IdentityProvider ──▶ I9 PolicyGate ──┐ I7 SecretsProvider ──────────────────────┤ I10 RateLimiter+CircuitBreaker ──────────┤ ▼ primitives (called by agent code) ┌────────────────────────┐ ───────────────────────────────── │ every external call │ I1 ToolRunner ◀── routes by persistence_class guarded + budgeted │ I2 Checkpointer ◀── LangGraph + Postgres │ I6 HITLBroker ◀── pause/resume/expire/escalate │ I12 ModelRouter ◀── task_class + sensitivity routing │ └────────────────────────┘ observability + promotion ───────────────────────── I4 TraceSink ◀── tagged by ── I11 CostAttributor I5 EvalGate ◀── consumes ─── I4 spans; INV-3 cross-family judge I3 WorkflowEngine (OPTIONAL) ◀── engaged when D4 trigger fires Rule: no cross-interface foreign keys. CI test enforces split by changing the connection string and confirming 🟢.
Takeaway: Twelve interfaces in a DAG, not a flat list. Five primitives the agent calls; five cross-cutting guards; two runtime substrates. Dependency edges dictate W1 scaffolding order. The cross-family judge invariant lives at I12, not I5, because routing is where families are decided.
Persistent + Ephemeral Sandbox Lanes Are Co-Equal
A real production agent has mixed workloads. Long-running coding agents need filesystem state, package caches, and process continuity across multi-step tasks — a persistent lane. Sub-second tool calls — a regex search, a unit conversion, a one-shot SQL query — want a cheap, fast, throwaway sandbox per call — an ephemeral lane. The mistake naive harnesses make is choosing one architecture and forcing both workloads through it. Persistent VMs for sub-second calls burn money on warm-up; ephemeral sandboxes for stateful coding agents discard the cache that made the next step cheap [ai-gs2026].
The foundation routes by persistence_class per tool, not by architecture commitment [ai-gs2026]. Every tool registered in the runtime declares one of three values: ephemeral, persistent_session (state across calls within a workflow), or persistent_workspace (state across workflows). The I1 ToolRunner reads this declaration and dispatches to the matching backend lane. Same interface, different backend per call. The decision is per-tool, not per-stack [ai-gs2026].
The defaults follow from D3: Daytona for the persistent lane, E2B for the ephemeral lane [ai-gs2026]. Two named switch triggers move the defaults: OpenComputer for teams with an Apache-2.0 licensing constraint, and Tensorlake when the first HIPAA workflow appears — because Tensorlake ships managed Firecracker with HIPAA, SOC2, and EU residency in box [ai-gs2026]. Confirm vendor license terms against the providers’ own documentation before adoption; the gold-seams source-map names the trigger conditions, not the current license strings.
D7 is the corollary — what persistence_class does a new tool get by default? The source picks ephemeral as the cheaper, safer default and switches to persistent_session only when ≥50% of new tools end up needing cross-call state in practice [ai-gs2026]. Same discipline the prior chapters named at different layers: prove you need the more expensive primitive before you pay for it.
Takeaway: Two co-equal lanes, routed by per-tool declaration, not by architecture commitment. Daytona+E2B by default; OpenComputer for Apache-2.0 purity; Tensorlake when HIPAA shows up. Default new tools to ephemeral; switch threshold is 50% of new tools needing state (D3 + D7).
LangGraph + Temporal: The Production Sandwich
The most consequential structural fact in the source is one sentence: “LLM-native runtime ≠ general durable execution; you need both, stacked” [ai-gs2026]. LangGraph gives loop semantics that match how agents actually think — node-by-node state with checkpointable transitions, replay on resume, branching for multi-agent coordination. Temporal gives the crash-replay-HITL substrate that decades of distributed systems have hardened — durable timers, deterministic activity replay, cross-region failover, signal-driven HITL pauses. Neither is the other. The 2026 production pattern stacks both [ai-gs2026].
The cross-references in the primary source are concrete. OpenAI Codex ships on Temporal [ai-gs2026]. Pydantic AI v1 has first-class Temporal adapters [ai-gs2026]. Klarna, Uber, LinkedIn, and Replit run LangGraph on Postgres in production [ai-gs2026]. These are not exotic combinations — they are the proven middle of the road.
D4 governs when Temporal joins the stack. The default for the POC is LangGraph alone [ai-gs2026]. Three switch triggers escalate to LangGraph + Temporal: a workflow that may run longer than 24 hours, an SLA penalty on missed execution, or a cross-region failover requirement [ai-gs2026]. None of those is hypothetical for a production agent platform once the first few workflows ship — but none of them is required on day one either. Day-one Temporal is an over-investment that delays the POC by weeks. Deferring it to a trigger is the move that lets W1 ship.
I3 WorkflowEngine is therefore an optional interface — defined day one, implemented when the trigger fires [ai-gs2026]. The contract is in the source, the package shell is scaffolded in W1, the backend lands the week after the trigger event. The architectural discipline this enforces: when the team eventually needs Temporal, the migration is one new backend in one package, not a runtime rewrite.
Takeaway: LangGraph for loop semantics; Temporal for durable execution; you eventually need both, stacked. D4 trigger says when. Day-one Temporal is over-investment; day-one interface for Temporal is the cheap insurance that makes day-N migration mechanical (D4).
The 10 Decision Triggers (D1–D10)
Numericized decisions are how the source replaces “it depends” with operator-grade specificity. Each decision has a default, a switch trigger that is measurable, and a known difficulty rating for the change [ai-gs2026].
| # | Decision | Default | Switch trigger |
|---|---|---|---|
| D1 | HITL transport | Slack | >3 personas OR audit-mandate OR Slack rate-limit |
| D2 | Phoenix self-host vs cloud | Cloud | Trace volume >10M spans/mo OR compliance OR cloud bill >$2k/mo |
| D3 | Sandbox lane defaults | Daytona (persistent) + E2B (ephemeral) | OpenComputer if Apache-2.0 purist; Tensorlake when first HIPAA workflow appears |
| D4 | WorkflowEngine timing | LangGraph alone (POC) | Workflow >24h OR SLA penalty OR cross-region failover |
| D5 | HITL UI | Slack bot + minimal Astro | ≥3 reviewer-distinct workflows OR external reviewer |
| D6 | Per-workflow LLM-call cap | 50 calls / workflow | Raise to 200 for long-running analysis; lower to 25 if workflow uses fewer than 10 in practice |
| D7 | Default persistence_class for new tools | ephemeral unless cross-call state needed | Switch to persistent_session when >50% of new tools need state |
| D8 | Postgres single vs split | Single (POC) | Single-interface QPS >1k/s OR ops on one interface blocks others |
| D9 | Default ModelRouter backend | Anthropic-only B1 (POC) | Add OpenRouter/Portkey when >1 closed provider OR open-weight enabled OR sensitivity-routing required |
| D10 | Open-weight share threshold | 0% during POC; enable for cheap_classify first in v2 | (a) eval-vs-Sonnet ≤2pp delta on the golden set, OR (b) cost ceiling D6 forces it, OR (c) compliance forces self-host (PHI on vLLM) |
Every default is justified by the POC’s cost-of-being-wrong, not by a “best” judgment — Slack is the HITL default because it costs nothing for ≤3 personas; Phoenix Cloud is default because self-hosting a trace backend pre-PMF is a tax on the wrong layer; Anthropic-only is default because one provider is one less thing to debug while the foundation lands [ai-gs2026]. Trigger conditions are measurable, not vibes. “Workflow >24h” is observable in a single span. “QPS >1k/s on a single interface” is one Postgres metric. “Eval-vs-Sonnet ≤2pp delta” is the EvalGate output. The triggers are queries, not “we’ll know it when we see it.”
Takeaway: Ten decisions with measurable triggers replace “it depends” with operator queries. Defaults optimize for POC speed and known-difficulty migration paths; triggers are observable, not vibes. The table is the operator’s cheat sheet for which knob to turn when.
Build vs Buy: Own Interfaces, Rent Backends
Build-vs-buy collapses to one heuristic under interface-design framing: write the interface; rent the backend [ai-gs2026]. Pricing is not a moat — every vendor in this space is on a cost curve that compresses each quarter. Interface stability is the moat for the platform team, in the same shape that org-encoded execution is the moat for the business (see Ch09). Both compound; both are non-transferable.
The packaging discipline is the architecture. The foundation lives in packages/foundation/<interface>/ and exposes a contract plus an in-repo mock backend [ai-gs2026]. Real backends live under packages/foundation/<interface>/backends/<vendor>/, wired by environment variable. Agent code imports the interface, never the vendor SDK. A CI grep test enforces that no provider SDK is imported outside the backends directory — the I12 INV-1 rule in the source [ai-gs2026]. The same discipline applies to I1 sandbox SDKs, I2 LangGraph adapters, I4 OTel exporters. Vendor surface area is allowed exactly one place; everywhere else it is grep-banned.
Stripe’s pattern from Ch09 is the analogy worth holding [boh-p3]. The MCP catalog encodes 400+ internal tools — none of those tools is the Stripe moat; the surfacing layer, the encoded ordering, and the years of accumulated invocation patterns are. The foundation here is the same shape one layer down. Daytona is not your moat; the I1 ToolRunner lane router that knows which persistence_class your tools declare is. The platform IP is the contract surface; everything below it is interchangeable, and that interchangeability is the value [ai-gs2026].
The honest part of this framing is the cost. Every interface package is more code than calling a vendor SDK directly, and every CI test is one more thing to maintain. The payoff lands in W6, not W1 — the day a different engineer ships W2 by writing new tools and graph nodes only, with zero changes to the foundation packages and a backend swap proven by an empty diff [ai-gs2026]. If that day does not arrive, the foundation has leaked, and the fix is package-boundary repair before declaring done.
Takeaway: Own interfaces; rent backends. Vendor SDK surface area lives in exactly one package per interface, enforced by a CI grep test. The cost is highest in W1; the receipt is the empty-diff swap in W6. Pricing is not a moat; interface stability is.
The Six-Week Plan
The plan ships in six weeks plus two days of W0 prep, with a three-person team [ai-gs2026]. Each week names which interfaces ship, which workflow drives the work, and the receipt that proves the week is done.
Wk │ Scope │ Interfaces shipped │ Receipt ───┼────────────────────────────────────┼────────────────────┼─────────────────────────── W0 │ 2-day prep: decisions + SDR slot │ — │ Decisions doc committed W1 │ Foundation skeleton, no agent yet │ I1, I2, I7-I12 │ 8 contract tests green; │ Postgres + LangGraph dev setup │ (8 of 12) │ INV-1 grep test green W2 │ Sandbox + tools — I1 real backends │ I1 backends live │ Tool replay test passes; │ Daytona + E2B; replay_class live │ │ Swap test #1: Daytona→ │ │ │ OpenComputer flip works W3 │ Observability + eval │ I4, I5 │ Trace UI shows agent loop; │ Phoenix + OTel + S3/R2 dump │ │ Swap test #2: Langfuse │ Phoenix Evals + golden dataset │ │ stub flip works W4 │ HITL + promotion gate │ I6 (no I3 yet) │ Pause → Slack → resume │ Postgres-LISTEN; Slack approval │ │ hits cache, no re-LLM; │ agent-eval.yml on PR │ │ degraded PR fails CI W5 │ Second workflow (W2 support- │ — (consume only) │ Dogfooded; bugs filed │ ticket triage); same foundation │ │ against foundation W6 │ Empty-diff swap proof │ — (consume only) │ git diff packages/ │ Different engineer ships W2; two │ │ foundation/ = EMPTY; │ backend swaps │ │ W2 happy-path ≤3 days W0 prep resolves D2 (Phoenix mode), picks I8 backend, commits the promotion contract YAML, books an SDR for the wk-3 golden-dataset labeling slot. W3-W4 critical paths run in parallel; the third engineer floats between gate and HITL.
W0 is non-negotiable — the two days resolve four decisions that block W1 (Phoenix mode, I8 backend pick, promotion contract YAML, golden-dataset labeling slot); shipping them in W1 instead means W1 stops twice waiting for a human decision [ai-gs2026]. W1 ships eight of twelve interfaces — I3 WorkflowEngine, I4 TraceSink, I5 EvalGate, and I6 HITLBroker defer to W3–W4 because their upstream interfaces have to be in place first [ai-gs2026]. W5 dogfoods — running a second real workflow (support-ticket triage, mostly stateless) on the W1 foundation surfaces interface bugs the W1 lead would miss because they wrote the interface. Bugs filed in W5 are fixed in W5; W6 is reserved for the swap proof, not for foundation patches [ai-gs2026].
The week-by-week receipts are deliberately uncomfortable to fake. “Tool replay test passes” means killing the workflow mid-flight, resuming, and confirming send_email does not fire twice. “Trace UI shows agent loop” means a screenshot of the agent’s node-by-node execution in Phoenix, replayable from the S3/R2 dump. “Degraded PR fails CI” means a deliberately-degraded prompt PR was created and the regression gate caught it. The receipts are the falsifiable bar per week; the plan does not declare done on prose [ai-gs2026].
Takeaway: Six weeks plus two days of W0 prep; three engineers; the receipts are falsifiable per-week, not narrative. W1 ships eight of twelve interfaces — the rest follow the dependency order. W5 dogfoods; W6 proves the swap.
Empty-Diff Swap as the Handoff Test
The W6 receipt is the falsifiable proof that the foundation works. A different engineer — not the W1 lead — ships W2 (support-ticket triage) using only: new tool definitions, new graph nodes, a different sandbox backend env (E2B-ephemeral instead of Daytona-persistent for at least one tool), and a second ModelRouter backend added alongside Anthropic-only B1 so a task_class=cheap_classify call routes to a non-Anthropic model [ai-gs2026]. No changes to any of the twelve interface packages. The proof is mechanical: git diff packages/foundation/ between the W1-merge tag and the W2-ship tag must be empty [ai-gs2026].
The bar matters because it is the only test that detects a leaky foundation. A foundation with leaks looks identical to one without until the second workflow lands — and by then, the leak has become the assumption, and ripping it out is a refactor instead of a fix. Empty-diff catches it the week it could be fixed. If the W2 implementation requires any change to a foundation package — a new interface method, a different contract shape, a backend-specific assumption that bled into agent code — the foundation has leaked. The fix is before declaring W6 done, not after [ai-gs2026].
W1 merge ──────────────────────────────────────────▶ W2 ship Engineer (different from W1 lead) writes ONLY: ───────────────────────────────────────────────── • packages/workflows/support_ticket/tools/*.ts • packages/workflows/support_ticket/graph.ts • .env (sandbox lane: DAYTONA → E2B for one tool) • .env (model-router: + Vercel AI SDK as B2; task_class=cheap_classify routes off-Anthropic) Then runs: $ git diff packages/foundation/ W1-merge..W2-ship Acceptance: ┌─────────────────────────────────────────────┐ │ empty diff → foundation HOLDS 🟢 │ │ non-empty → foundation LEAKED 🔴 │ └─────────────────────────────────────────────┘ Plus three runtime receipts: • W2 happy-path complete in dev in ≤3 working days • HITL pause + resume works on W2 • ModelRouter second-backend swap works for ≥1 task_class If diff is non-empty: name the leaked package, file the fix, re-run. Do NOT ship W6 with a leaked interface — it normalizes the leak.
The runtime receipts that ride alongside the empty diff are equally specific [ai-gs2026]. W2 happy-path complete in ≤3 working days proves the agent-development velocity the foundation enables. HITL pause+resume on W2 proves the I6 contract holds across workflows. ModelRouter second-backend swap on at least one task_class proves I12 holds with more than one provider. Four receipts; one week; if any fails, the foundation needs the fix before W6 closes, not after.
Takeaway: Empty-diff swap is the only test that catches a leaky foundation in time to fix it. Different engineer; new tools and graph; backend env flips; zero foundation-package changes. The diff is the proof — and a non-empty diff is the cue to fix the leak before declaring done [ai-gs2026].
Do This, Not That
| Pattern | Naive | Correct | Why |
|---|---|---|---|
| What you own | A stack pick — sandbox + runtime + obs vendors | The 12 interfaces (I1–I12) and their contracts; the vendors are env-flips | Picks today are proven defaults, not commitments; the moat is interface stability [ai-gs2026] |
| Where vendor SDKs live | Imported anywhere in agent code | One package per interface, under backends/<vendor>/; agent code imports the interface | CI grep test enforces; without it, “swap claim” is aspirational [ai-gs2026] |
| Sandbox architecture | Pick one — persistent OR ephemeral | Two co-equal lanes; route per-tool via persistence_class; new tools default to ephemeral (D7) | Mixed workloads are reality; routing per call beats one-size-fits-all [ai-gs2026] |
| Workflow engine timing | Stand up Temporal day one for “production readiness” | LangGraph alone for POC; engage Temporal when D4 trigger fires (>24h workflow OR SLA penalty OR cross-region failover). When this breaks: if the first workflow has a regulatory SLA, engage Temporal day one; eating two weeks now beats a runtime rewrite later [ai-gs2026] | Day-one Temporal is over-investment; the I3 interface is the cheap insurance |
| LLM judge for the eval gate | Same family as the agent (Sonnet-judges-Sonnet, GPT-4-judges-GPT-4o) | Cross-family judge enforced at I12 routing rule; task_class=judge MUST select different family from agent’s I12 binding | Same-family judging inflates pass rate 5–10pp via self-preference (arxiv 2410.21819); INV-3 in I5 + routing rule in I12 [ai-gs2026] |
| Sandbox for HIPAA workflows | Self-host Firecracker as DIY (“looks tractable”) | Tensorlake (managed Firecracker with HIPAA + SOC2 Type II + EU residency in box) | Self-host Firecracker is 12–18 months plus a security hire per the source’s vendor survey [ai-gs2026] |
| Postgres-as-everything | One DB, foreign keys across I2 / I6 / tool_call_results / I8 | One DB by default (POC); zero cross-interface FKs; CI test confirms split-by-changing-connection-string is 🟢 | Cross-FKs make the “swap” claim aspirational; the no-cross-FK rule is what keeps D8 actually executable [ai-gs2026] |
| Promotion-gate regression threshold | N=50 + 3pp threshold | Rolling N=150 unless release tag; override path with audit row | N=50 + 3pp flakes ~30% on neutral PRs; rolling N stabilizes signal-to-noise [ai-gs2026] |
| Cost / runaway protection | Per-call cost cap | Per-workflow LLM-call cap (D6 default 50); plus I10 CircuitBreaker per-workflow cumulative-cost cap; alert at 50%, hard kill at cap | Sub-agents spawn sub-agents; per-call caps miss recursive runaway [ai-gs2026] |
| Empty-diff swap declaration | W1 lead implements W2 themselves; declares “swap works” | Different engineer implements W2; git diff packages/foundation/ between W1-merge and W2-ship must be empty | Only a different engineer surfaces the assumptions the W1 lead absorbed silently [ai-gs2026] |
Takeaway: Own interfaces; route by declaration; cross-family judge at I12; defer Temporal until D4 fires; no cross-FKs in Postgres; stabilize the gate with rolling N; runaway protection at the workflow level; prove the swap with a different engineer.
Gotchas
| Symptom | Cause | Fix |
|---|---|---|
| In-flight LangGraph runs die silently after a minor version bump | LangGraph checkpointer schema migration with no built-in migration tool — issue #536 in the source [ai-gs2026] | Versioned graph IDs (name@semver); blue/green deploy; CI replay test on the last 50 production checkpoints; never share a graph_id across breaking schema changes [ai-gs2026] |
send_email fires twice when a workflow resumes after a HITL pause | Default LangGraph behavior replays nodes including tool calls on resume; the tool’s side effect was not classified | Declare every tool’s replay_class at registration (pure / idempotent_with_key / unsafe_on_replay); wire the tool_call_results cache; unsafe_on_replay throws hard on the second call [ai-gs2026] |
| Plan to “self-host Firecracker because it looks tractable” enters W1 | Underestimating sandbox-vendor scope; the vendor survey in the source clocks it at 12–18 months plus a security hire | Reject self-hosted Firecracker for the POC; use Tensorlake when the first HIPAA workflow appears (D3 switch trigger); document the rejection in the §“Rejected” section of the decisions doc [ai-gs2026] |
| EU customer fails compliance review months after sandbox pick | Vercel Sandbox is iad1-only; documented but easy to miss until the review | Reject Vercel Sandbox at the sandbox-pick stage; pre-stage Tensorlake (EU residency in box) as the swap candidate; document in §5f sandbox call-outs [ai-gs2026] |
| Eval gate reports 95% pass; the agent ships and regresses on real traffic | Same-family LLM judge inflated the pass rate by 5–10pp via lower-perplexity self-preference (arxiv 2410.21819) | Enforce cross-family judge at the I12 routing rule (task_class=judge → different family); I5 INV-3 verifies at every eval run [ai-gs2026] |
| Regression gate blocks ~30% of neutral PRs | N=50 + 3pp threshold lacks statistical power on the typical-case PR | Switch to rolling N=150 unless a release tag forces a smaller sample; document the override path with an audit row [ai-gs2026] |
| “Just swap Postgres for DynamoDB” turns into a multi-week refactor | Cross-interface foreign keys between I2, I6, tool_call_results, and I8 tokens turned the “swap” claim aspirational | No-cross-FK rule across interfaces; CI test confirms split-by-changing-connection-string is 🟢; pay the up-front discipline tax to keep the swap option real [ai-gs2026] |
| Agent recursively retries the same path until the per-call cap trips, burning thousands of tokens | LLM agents lack built-in cycle detection; LoopDetection middleware was not in the agent loop | I10 RateLimiter per-workflow LLM-call cap (D6 default 50); LoopDetection middleware in the agent loop; audit row whenever it triggers [ai-gs2026] |
Takeaway: Most gotchas reduce to: an architectural invariant that is true in theory (replay safety, swap-ability, statistical power, cross-family judging) but only enforced in practice if the foundation has a specific piece of plumbing (replay-class declaration, no-cross-FK rule, rolling-N gate, I12 routing rule). The plumbing is the chapter; the invariant is the receipt.
What This Chapter Teaches About the Rest of the Series
The build-your-own playbook is the operator instantiation of the invariants the rest of the series named — replay safety (Ch05) enforced by replay_class + tool_call_results behind I1/I2, prompt-cache stability (Ch07) preserved across I12 swaps [cci2026-gems, §2], the session-memory loop (Ch08) depositing into the substrate the foundation packages expose, and the org-harness moat (Ch09) compounding on top of the six-week foundation [boh-p3] [ai-gs2026]. Ch12 — Pitfalls and Anti-Patterns takes the seven pitfalls this chapter listed and inspects the operational shapes they take when teams hit them.
Takeaway: One architecture, five chapters of substrate, one operator playbook. The 6-week plan instantiates the invariants the prior chapters named; the moat compounds on top of the foundation. Ch12 takes the seven pitfalls and walks the operational shapes.
One question for the reader: Could you point to (a) the package directory that owns each of the 12 interfaces in your harness, with mock backends + a contract test; (b) the CI grep test that blocks vendor SDK imports outside backends/<vendor>/; (c) the tool_call_results cache table and the replay_class annotation on every registered tool; and (d) the W6-style swap proof — a git diff of your foundation packages between the day W1 merged and the day W2 shipped, showing the diff is empty? If any of the four is missing, the foundation is a stack-pick wearing interface-design vocabulary, not the substrate the rest of the series builds on.
References
- [ai-gs2026]
tacit-web/research/agent-infra/03-gold-seams.md— Gold-seams source map for production-agent harness patterns, 2026-04-27, with §-pointers intotacit-web/research/agent-infra/PLAN.mdv3.3. Primary source for: the 5 core concepts (foundation-as-interfaces, LangGraph+Temporal stacking, persistent+ephemeral lanes co-equal, replay safety as “stop redundant work” mechanism, cross-family judge mandate); the 7 pitfalls (LangGraph checkpointer schema migration / issue #536, tool replay duplicates side effects, self-host Firecracker as DIY, Vercel Sandbox iad1-only EU landmine, same-family judge inflates 5–10pp / arxiv 2410.21819, regression-gate stat-power N=50+3pp flakes, Postgres-as-everything cross-FK trap); the 10 decision triggers D1–D10; the 12 interfaces I1–I12 with method signatures and invariants from PLAN §5a + §5b; the W0+W1–W6 plan from PLAN §8; the empty-diff swap acceptance criterion; the named vendor defaults (Daytona, E2B, OpenComputer, Tensorlake, LangGraph, Postgres, Temporal, Phoenix, OpenLLMetry, Postgres-LISTEN, Slack, Anthropic, OpenRouter, Portkey, LiteLLM, Vercel AI SDK, LangChain ChatModel, GitHub Actions, Pydantic AI, RouteLLM, NotDiamond, Martian, Llama-3.3-70B, Claude Sonnet, Claude Opus, GPT-4o, OpenAI Codex on Temporal GA 2026-03-23, Klarna / Uber / LinkedIn / Replit on LangGraph+Postgres); the build-vs-buy matrix from PLAN §7; the package-boundary CI grep test (I12 INV-1); the no-cross-FK Postgres discipline from PLAN §5d. - [boh-p3]
tacit-web/research/building-org-harness/phase3-compounding-moat.md— Phase 3 source map dated 2026-05. Cited inline in §“Build vs Buy” and §“What This Chapter Teaches About the Rest of the Series” for the Stripe MCP catalog as the org-context-moat analog one layer up, and for the compounding-moat framing the W6 foundation enables. - [cci2026-gems]
tacit-web/research/cc-internals/src-analysis-07-hidden-gems.md, §2 “PROMPT CACHE STABILITY — Obsessive Engineering,” 2026-04-01. Cited inline in §“What This Chapter Teaches About the Rest of the Series” for the prefix-side cache-stability invariant that I12 ModelRouter must preserve across provider swaps.
Next chapter: 12 — Pitfalls and Anti-Patterns