I/D/E · harness-engineering

Replay Safety: The Bug That Breaks Every HITL Workflow

Summary

Your agent crashed mid-workflow. On resume, it sent the same email twice. LangGraph replays tool calls by default — and that default is wrong for half the tools in your harness. The fix is a three-class taxonomy and a tool-result cache, and it has to be in your harness on day one, not day fifty.

Prerequisite: Part 5 of the Harness Engineering deep dive. Read Part 04: Coordinator Mode first — the worker-crash failure mode that chapter defers is the failure mode this chapter solves.

Workflow resume: three tool classes, three replay actions

Workflow crashes mid-execution. On resume, the three tool classes route differently. The default is wrong for half of them.

Why This Matters

A five-step human-in-the-loop workflow runs in a graph runtime. Read context, plan, send an email, wait for human approval, commit the change. The runtime saves a checkpoint after each step. Between step three and step four — after the email has gone out, before approval is recorded — the worker process crashes. The on-call resumes the workflow from the last checkpoint. The runtime, doing exactly what it was designed to do, replays the pending tool call at step three. The customer receives the same email twice. This is the bug. It is not a tail event. It is the default behaviour of LangGraph and similar graph runtimes with checkpoint-based resume [ai-gs2026, §6.2].

Most public HITL content gets it wrong in three ways. First, “just retry on failure” (the reliability-engineering framing) assumes tool calls are pure functions. That assumption fails for send_email, charge_card, and post_tweet. Second, “the checkpointer handles it” (the LangGraph-documentation framing) treats persistence as solved and ignores that the checkpointer records what should run next, not what has already happened. Third, “make all tools idempotent” (the Stripe-pattern generalization) is right for network APIs but does not fit the half of an agent’s toolset that is local filesystem, internal RPCs, or third-party endpoints with no idempotency-key support. None of those framings tells you what your harness should do when the workflow resumes and the next tool is the one that already fired.

The correct fix is structural, not heroic. The harness classifies every tool into one of three replay classes, records every tool result in a cache keyed by (workflow_id, node_id, checkpoint_id, tool_name, input_hash), and on resume consults the cache before re-executing anything with side effects. For the unsafe class, the runtime is allowed to do exactly two things — return the cached result, or throw — and never to silently re-execute. This pattern is enforced by a CI test that replays the last 50 production checkpoints on every pull request, so a regression in replay correctness gets caught at the gate rather than in front of a paying customer [ai-gs2026, §6.5].

Takeaway: Replay is not retry, the checkpointer does not solve it, and idempotency keys do not generalize. The fix is a tool-side-effect taxonomy plus a result cache, owned by the harness.

The Bug in One Paragraph

The workflow has five nodes: read_context, plan, send_email, wait_for_approval, commit. The graph runtime saves a checkpoint at each node boundary. At node 3, the runtime asks the agent for its next tool call, gets back send_email(to="...", body="..."), marks the call as pending_tools on the checkpoint, dispatches the call, and waits for the result. The email API returns 200, the email is sent — and then, before the runtime writes the result back into state and advances to node 4, the worker process is killed. SIGKILL from the OOM-killer, a deploy bouncing the pod, or a network partition between the worker and the database. The last durable checkpoint still has send_email listed in pending_tools. The runtime resumes. It sees pending tools at node 3. It re-dispatches them. The email fires twice. Nothing in the default loop notices that step 3 was already done, because nothing in the default loop has the receipt — the email API responded, but that response is not in the checkpoint [ai-gs2026, §6.2; pa2025-checkpointing].

This chapter does not pretend to close the pre-commit window itself. If the side effect fires AND the cache write fails before the transaction commits, replay re-executes — there is nothing the harness can do at that point because the durable record of the side effect was lost with the process. Two-phase commit to external APIs is the only fully general solution, and external APIs almost never support it. What the design does close is the much larger window between “cache+checkpoint committed” and “next node advanced,” which is where almost all real worker-crash replays land.

Takeaway: The window between “side effect fired” and “side effect recorded in checkpoint” is non-zero and unavoidable. Replay safety is what you do inside that window.

Why the Default Is Wrong

Graph runtimes inherit a mental model from deterministic dataflow. Airflow, Dagster, plain-old workflow engines — they all assume a node is a pure function of its inputs, and they all assume re-running a node yields the same output. Replay-by-default is the right answer for that world: if the worker dies, replay the node, get the same result, move on. The agent world breaks both assumptions at the same time.

The first broken assumption is that tool calls have no side effects. In a typical agent harness, the tool inventory is roughly half network actions (send_email, charge_card, create_issue, post_slack), a quarter read-only lookups (select_from, fs.read, web_search), and a quarter local mutations (fs.write, git_commit, cache_put). Only the read-only quarter is genuinely safe to replay without thought. The other three quarters are either non-idempotent at the upstream API or only idempotent if the harness participates — by passing an idempotency key, by checking a dedup table, by structuring the input to be content-addressed.

The second broken assumption is that re-running yields the same result. Even when a tool is technically idempotent, the cost of re-running is not zero. A web search that re-runs at LLM-summary time burns the same token budget every replay. A read_file that returns 200KB of source is cheap on the local filesystem and expensive in the language model that has to re-summarize it on each replay-walk. The default replay strategy treats LLM token spend as if it were CPU cycles, and it is not — every token re-fed through the model is real money [pa2025-idempotency].

The runtime cannot fix this on its own. The runtime does not know whether acme.send_email is the same as local.draft_email. It cannot infer the difference from the tool name, from the schema, from the HTTP verb, or from whether the response carried a Stripe-style idempotency-key header. The harness has to tell it, per tool, what the replay rule is. That is the missing layer, and the harness is the only place in the stack where the knowledge exists.

Takeaway: The default treats tool calls as pure functions of their inputs. They are not. Side effects + token cost together force a per-tool replay rule, declared by the harness.

The Three-Class Taxonomy

Every tool registered with the harness carries a replay_class field. The taxonomy in the agent-infra reference is three named values [ai-gs2026, §6.2]:

type ReplayClass =
  | 'pure'                   // f(x) = f(x); deterministic, no side effects
  | 'idempotent_with_key'    // safe to retry IF caller passes the same idem key
  | 'unsafe_on_replay';      // side effect cannot be deduplicated by the upstream

type ToolDefinition = {
  name: string;
  replay_class: ReplayClass;
  idempotency_key_fn?: (input: unknown) => string;  // required iff idempotent_with_key
  // ... schema, policy tags, rate-limit class, etc.
};

pure is the read-only / deterministic class. SQL SELECT, fs.read, in-memory transformations, search APIs that are content-stable for the same query, anything where the output depends only on the input and no external state is mutated. The replay rule is permissive: re-running is correct by definition. The harness still records the result in the cache, because re-running is not free — it costs LLM tokens to re-summarize, and the point of caching is to skip the token cost on replay walks, not to prevent correctness errors.

idempotent_with_key is the network-action class that the upstream API has designed for retry. Stripe’s idempotency-key header is the canonical example: the same key on the same charge body returns the same charge object, not a new one. The harness’s job here is to compute a stable key from the tool input — idempotency_key_fn(input) — and pass it through to the upstream. The replay rule is: re-execute is allowed, provided the harness can guarantee the same key. If the key would change across replays (because it was derived from a timestamp or a random nonce), the tool was misclassified and the cache is the only safe path. See [pa2025-idempotency] for the key-generation rules.

unsafe_on_replay is the class where re-execution can never be allowed to fire blindly. Slack chat.postMessage without a thread_ts, an outbound webhook to a partner that does not honour idempotency, a git push, a payment to a counterparty that has no dedup contract. The runtime is allowed exactly two actions on replay: return the cached result if one exists, or throw ReplayUnsafeError. There is no third option that re-executes “carefully.” Carefulness is what the cache is for.

Replay classTypical examplesReplay actionRequired harness contract
pureselect_from, fs.read, web_search, count_tokensre-execute or cache-hit (cache for token savings)none beyond schema
idempotent_with_keystripe.charge, s3.put_object with If-None-Match, dynamo.put_item with conditional expressionre-execute with stable key, or cache-hitidempotency_key_fn registered
unsafe_on_replayslack.post (no thread), acme.send_email, partner webhook, git pushcache-hit if present, else throw ReplayUnsafeErrorresult cached pre-commit; HITL surface for unsafe-replay errors

The classification is declared at registration time, not inferred at run time. The runtime cannot read the tool’s documentation and decide whether it is safe. The agent author makes the call when they register the tool and the harness enforces it on every dispatch. Misclassification is the failure mode and it is detectable — a CI invariant rejects any deployed graph that uses a tool without a replay_class field [ai-gs2026, §6.2].

Takeaway: Three classes, declared at registration. Pure and idempotent-with-key may re-execute; unsafe-on-replay may not. The harness, not the runtime, owns the classification.

The tool_call_results Cache

The cache is one table. Composite-key design, indexed for resume-time lookup, schema small enough to fit on a slide [ai-gs2026, §6.2]:

CREATE TABLE tool_call_results (
  workflow_id     TEXT NOT NULL,
  node_id         TEXT NOT NULL,
  checkpoint_id   TEXT NOT NULL,
  tool_name       TEXT NOT NULL,
  input_hash      TEXT NOT NULL,    -- sha256 of normalized input
  result_json     TEXT NOT NULL,    -- cached result
  created_at      INTEGER NOT NULL,
  PRIMARY KEY (workflow_id, node_id, checkpoint_id, tool_name, input_hash)
);
CREATE INDEX tcr_workflow ON tool_call_results(workflow_id, created_at);

Five fields make up the primary key, and each one earns its place. workflow_id scopes the cache to a run so two parallel workflows cannot collide. node_id scopes within the workflow so the same tool called at two different graph nodes is treated as two distinct calls — this matters when an agent does, say, fs.read of the same file at the planning node and again at the verification node. checkpoint_id scopes by graph version: a workflow that has migrated mid-flight to a new graph version does not get stale results from the prior version. tool_name is the obvious axis. input_hash is sha256(canonical_json(input)) — canonicalisation matters; a tool result keyed on a non-canonicalised JSON blob will miss on every replay because key ordering is unstable. Use a canonicalisation spec the team can point to — RFC 8785 (JSON Canonicalization Scheme, “JCS”) is the obvious choice, with reference implementations like the canonicaljson family of libraries — rather than rolling a bespoke serializer per service.

The write is atomic with the checkpoint write. The invariant from the agent-infra reference is explicit: save() of WorkflowState and the tool_call_results index update happen in a single database transaction [ai-gs2026, §I2-INV-3]. If the cache write commits and the checkpoint advance fails, on replay the cache is consulted and the tool returns the cached value — correctness preserved. If neither commits, on replay the tool re-executes — also correct, modulo the pre-commit window that this design cannot close. The window the design does close is the one that occurs after the side effect has fired and before the next node has advanced.

TTL is a deployment decision, not a hard schema constant. The cache must outlive the longest workflow plus the longest human-approval window plus the operator’s debug horizon. For a HITL workflow that waits up to 72 hours for human approval, a 14-day TTL is conservative. For pure read-only tools, the cache could be evicted within the same workflow’s lifetime — token savings disappear after the workflow completes anyway. A reasonable deployment pattern is to TTL by replay_class — long for unsafe_on_replay (audit retention drives retention; the cached result is the receipt that a human signed off on a forward-recovery decision), shorter for pure (where the cache is just a weak token-savings layer). The reference is silent on TTL strategy; TTL is operator judgment.

Takeaway: One table, five-field primary key, atomic with checkpoint writes, TTL by replay class. The cache is small and the discipline around it is what makes it correct.

What “Unsafe on Replay” Actually Does

The unsafe-on-replay branch is two lines of code in the resume path and a whole afternoon of design discipline. The branch [ai-gs2026, §6.2]:

on resume(workflow_id, checkpoint_id):
  for each tool_call in checkpoint.pending_tools:
    cached = lookup(tool_call_results,
                    workflow_id, checkpoint_id, tool_call.name,
                    sha256(canonical_json(tool_call.input)))
    if cached:
      return cached.result_json            # NO RE-CALL
    else if tool.replay_class == 'unsafe_on_replay':
      throw ReplayUnsafeError(tool_call)   # surface to operator
    else:
      result = execute(tool_call)
      write_cache(tool_call, result)
      return result

The two actions on cache miss for an unsafe tool are return cache and throw. No third action exists: the harness has no way, at resume time, to know whether the side effect fired or not — that is the information the crash deleted. Returning a fabricated success is a correctness bug — the next node will proceed as if the email had gone out and the workflow will silently skip a side effect that may or may not have happened. Returning a fabricated failure is worse: it causes a compensating action (refund, retract, apologize) that may itself be a side effect, and now the user has been emailed twice and gets a “please ignore previous email” follow-up — two side effects to clean up instead of one. The only correctness-preserving move is to refuse and surface; the cost of refusing is one human deciding, the cost of guessing is unbounded.

ReplayUnsafeError should not be a silent retry-with-backoff. It is an operator-visible signal. The reference says only “surface to operator”; routing to the HITL queue with the original tool call attached is the cheapest implementation on a harness that already has a human-review path. A human decides whether the side effect happened — by checking the upstream system, the email provider’s dashboard, the partner’s audit log — and resolves the workflow forward. Uncomfortable but correct. The alternative — auto-retrying an unsafe tool — is a customer-visible failure every time the recovery decision was wrong.

The discipline around classification matters most here. A tool that the team marked idempotent_with_key but whose upstream does not actually honour the key on the second call is, in practice, an unsafe_on_replay tool wearing a costume. The way to catch this is the CI replay test (next section) plus a smoke-test deployment that intentionally bounces the worker mid-call and inspects the upstream for duplicates. If duplicates appear, the tool is reclassified.

The pre-commit window is worth seeing on a timeline. The cache write and the checkpoint advance are inside one transaction; the side effect upstream is not. The unsafe zone is the span between “side effect fired at the upstream” and “the transaction commits locally”:

PRE-COMMIT WINDOW (where replay safety stops being recoverable)
time 

t0           t1                t2                  t3
                                                 
dispatch    side effect       cache write +       (next node
tool_call   fires at          checkpoint           advances)
          upstream API      advance in one TX
                                                 
                                                 
[........UNSAFE ZONE.........][...........SAFE ZONE...........]
crash here ⇒ replay re-fires    crash here ⇒ replay hits cache
(no receipt anywhere)           (receipt is the cache row)

Width of UNSAFE ZONE = network RTT to upstream + local DB latency

Mitigation in scope:   shrink t1t2 (close window, never zero)
Mitigation out of scope: two-phase commit to external APIs
REPLAY DECISION TREE (per pending tool call on resume)
pending tool call on resume
          
          
 compute input_hash
          
          

 lookup(workflow_id, node_id,  
  checkpoint_id, name, hash)   

         
 
                 
cache hit       cache miss
                 
                 
RETURN cached     
(no side effect)   replay_class ?         
                
                         
          
                                       
        pure       idempotent_       unsafe_on_
                  with_key             replay
                                        
     execute +     execute w/        THROW
     cache         stable key +      ReplayUnsafeError
                   cache             (HITL surface)

Takeaway: Cache-hit or throw — never re-execute an unsafe tool on cache miss. The throw is a feature, not a flake.

Checkpointer Schema Migration (Issue #536)

The class of bugs around graph-runtime checkpointer behaviour shows up in production any time a minor-version deploy changes the on-disk serialization of a saved state. The agent-infra reference documents this pattern, attributing the canonical failure to the LangGraph community’s tracker as issue #536 — no built-in migration tool, serialization shifts between minor versions, in-flight runs that wrote state on Friday cannot be replayed on Monday because the deserialiser has changed underneath them [ai-gs2026, pitfall #1; §6.5]. The shape generalises beyond LangGraph: any checkpointer that pickles or json-serialises a runtime’s in-memory state will eventually break replay across a runtime upgrade.

The mitigation is a five-line playbook from the agent-infra reference [ai-gs2026, §6.5]:

  1. Versioned graph IDs. graph_id = "<workflow-name>@<semver>". Old runs stay on the version they started on. New runs get the new version. The checkpointer’s load() rejects a checkpoint whose graph_id does not match the currently deployed version unless a migrate() has been recorded.
  2. Blue/green deploy of the runtime. The new runtime version deploys alongside the old. The old keeps draining in-flight runs to completion. New traffic routes to the new version. No in-place upgrades. No “let’s just bump the dependency and ship.”
  3. Explicit stage mapping for cross-version migration. When a workflow must be migrated forward (because the old version is being decommissioned), the migration is an admin-gated operation with an explicit stage_mapping: {old_node_id: new_node_id} and an audit row recording who migrated what.
  4. Never modify a deployed graph in place. Graph changes always create a new version. The graph_id semver bump is mechanical and CI-enforced.
  5. CI test: replay last 50 production checkpoints. Every PR runs the new graph version against the most recent 50 production checkpoints. Any checkpoint that cannot load — for any reason — fails the build.

The fifth rule is the load-bearing one. Without it, the other four are guidelines on paper that get violated under deploy pressure. With it, the discipline is mechanical: if a developer changes the state schema in a way that breaks replay, CI catches it before the deploy, not after.

Migration riskWithout playbookWith playbook
State serialization drift between runtime minor versionsSilent in-flight breakage on deployOld runs keep draining on old runtime; new runs use new graph_id
Pickled lambda or closure becomes unloadableResume fails with cryptic error mid-prodCI replay catches at PR time
Node renamed without explicit mappingOld checkpoints reference dead node IDsmigrate() requires stage_mapping; CI rejects unmapped renames
Cross-version tool_call_results cache staleResume hits an entry from a prior graph version, returns wrong resultcheckpoint_id in cache key is graph-version-scoped

Takeaway: Versioned graph IDs, blue/green runtime deploy, audited migrations, replay-50 CI test. The five-line playbook is one rule per failure mode.

The CI Pattern: Replay Last 50 Production Checkpoints

The CI replay test is the single highest-leverage piece of automation in a replay-safe harness, because it converts an entire class of integration bugs — replay correctness — into a build failure. The pattern from the reference is small and reusable [ai-gs2026, §6.5]:

# .github/workflows/checkpoint-replay.yml — pseudo-shape
- name: Pull last 50 production checkpoints
  run: ./scripts/fetch-checkpoints.sh --limit 50 --env prod \
       --out ./tests/fixtures/checkpoints/

- name: Replay against PR graph version
  run: pnpm test:checkpoint-replay
       # for each checkpoint:
       #   - load into new graph_id
       #   - execute resume() against a mock tool runner
       #   - assert no ReplayUnsafeError + no schema-load failure
       #   - assert pending_tools dispatch matches recorded cache entries

The mock tool runner is the trick that makes the test cheap. It does not call real APIs. It looks up every tool dispatch in tool_call_results and returns the recorded result if present, throws ReplayUnsafeError if absent and class is unsafe, otherwise returns a class-aware stub. The point of the test is not to validate the upstream APIs; it is to validate that the harness still knows how to resume a checkpoint after the PR’s code changes. If the PR renamed a node, mapped it incorrectly, or changed a tool’s schema in a non-backward-compatible way, the test fails on at least one of the 50 checkpoints and prints the offending one for triage.

Fifty is the value the reference uses [ai-gs2026, §6.5]; the trade-off behind it is real even though the source does not show its work. With N=50 captured checkpoints sampled across the last 24 hours of production runs, the replay test has reasonable coverage of workflow shapes, tool-call patterns, and graph nodes hit in practice. With N=10, the test would miss the rare-path replay bugs. With N=500, the test would be too slow to run on every PR and would degrade to a nightly batch. Refresh the fixtures weekly so the test continues to catch regressions against the current production traffic shape rather than last quarter’s.

A regression caught by this test is usually one of three categories: (1) a tool’s schema changed and the cached input_hash no longer matches the new canonicalisation; (2) a graph node was renamed without a stage_mapping; (3) a state field’s serialization drifted (Pydantic version bump, Postgres JSONB encoder change). All three are fast to diagnose because the failing checkpoint is dumped to a fixture and the test re-runs deterministically against it.

Takeaway: 50 production checkpoints, replayed on every PR, against a mock tool runner. Build fails when replay correctness regresses. Weekly fixture refresh keeps the test honest.

Do This, Not That

PatternNaiveCorrectWhy
Resume after crashre-dispatch pending tools blindlylook up tool_call_results, throw on unsafe missThe window between side-effect and checkpoint-commit is the bug; the cache is the fix
Classifying a new tool”we’ll figure it out at runtime”declare replay_class at registration; CI rejects missingMisclassification is the failure mode and it has to be visible in code review
Caching a pure toolskip the cache for read-only toolscache anyway for LLM token savings on replay walksPure ≠ cheap when the result is fed back through the model
idempotent_with_key key derivationhash includes timestamp or retry counthash uses only workflow-stable inputs (workflow_id, node_id, content)A key that changes across retries is not an idem key
unsafe_on_replay cache missretry with backoff, log a warningthrow ReplayUnsafeError, surface to HITLAuto-retry of an unsafe tool is the duplicate-email bug
Checkpointer migrationedit the deployed graph in placenew graph_id semver bump + blue/green + stage_mappingIn-place edits are the documented LangGraph #536 failure mode
Replay-test fixture sethand-curated synthetic checkpointslast 50 from production, refreshed weeklySynthetic fixtures miss the rare-path bugs; production traffic does not
Tool result cache TTLone global TTLTTL by replay_class (long for unsafe/audit, short for pure)The audit retention story differs from the token-savings story
tool_call_results schemaone row per callcomposite key (workflow_id, node_id, checkpoint_id, tool_name, input_hash)Each axis prevents a documented collision class
Atomicity of cache + checkpointwrite cache, then write checkpointsingle transaction across bothTwo-write design leaves a window that replay cannot reason about

Takeaway: Declare class at registration, cache everything, throw on unsafe cache miss, version your graphs, replay-50 in CI.

Gotchas

SymptomCauseFix
Cache miss on replay even though the result is recorded; tool re-runs or throws unexpectedlyinput_hash computed on non-canonicalised JSONCanonicalise JSON before hashing — sort keys, normalise number formats, strip whitespace; pin to RFC 8785 (JCS)
Duplicate side effects in prod despite the harness “doing the right thing”Tool marked idempotent_with_key but upstream ignores the keySmoke-test the upstream with two calls + same key; if duplicates appear, reclassify to unsafe_on_replay
Workflow resumes after a 4-day approval delay; cache has expired; unsafe tool throwsCache TTL shorter than longest HITL approval windowTTL must exceed the longest workflow + longest pause + audit horizon — 14 days is a sane floor for HITL
Unsafe tool returns attacker-controlled JSON on replay even though the upstream never saw the callAttacker (or buggy job) writes directly into tool_call_results; on replay, the cache hit wins over re-executionTreat tool_call_results as a privileged store — guard writes with a PolicyGate, restrict the DB role used by replay-cache writes, and audit every row’s created_at against the originating span
Workflow appears to hang or flake; operator does not see the error surfaceReplayUnsafeError auto-retried by an upstream queue runnerErrors of class ReplayUnsafeError are non-retryable; mark them in the queue’s retry policy explicitly
Resume fails after a runtime version bump that changed module pathsPickled closures or lambdas in checkpoint stateForbid non-data state fields; checkpoint state is plain data only, executable code lives in the graph definition
Old checkpoints reference dead node IDs; resume crashesGraph node renamed without stage_mappingmigrate() requires explicit mapping; CI rejects unmapped renames at PR time
Resume on a migrated workflow returns a stale result from prior graph versiontool_call_results not partitioned by checkpoint_id for graph-version scopecheckpoint_id is part of the primary key and is graph-version-scoped; do not collapse the axis to “save space”
One workflow returns the other’s cached resultTwo parallel workflows with the same logical key collideworkflow_id is first in the primary key; do not drop it even when “the user only has one workflow at a time”

Takeaway: Most gotchas reduce to: do not collapse the cache key, do not auto-retry unsafe errors, do not let TTL be shorter than the longest pause.

What Replay Safety Teaches About the Rest of the Series

Coordinator mode (chapter 04) defers the worker-crash-mid-task failure to this chapter [cci2026]; the answer is the same shape — a per-spawn result envelope plus a cache plus a throw-on-unsafe-replay branch, scaled from the single-process case to the multi-worker case. Session memory (chapter 08) only works once replay is safe: a memory layer that records observations across sessions has to be able to tell whether an observation was newly produced or just replayed, and the tool_call_results cache is the receipt that makes the distinction tractable.

Takeaway: Replay safety is the precondition for the next four mechanics in this series. Build it first, on day one rather than day fifty; the rest of the harness stops being load-bearing the moment this is missing.

References

  1. [ai-gs2026] tacit-web/research/agent-infra/03-gold-seams.md and tacit-web/research/agent-infra/PLAN.md — Agent-infra gold-seams synthesis, 2026-04-27. Sections cited: §6.2 Tool Side-Effect Classification (replay_class taxonomy, tool_call_results cache schema, resume-time dispatch); §6.5 Schema Migration Playbook for LangGraph Checkpointer (versioned graph IDs, blue/green deploy, CI checkpoint-replay test); §I2 Checkpointer invariants INV-1 through INV-4 (atomic save, version rejection, audited migrate). Pitfall #1 (checkpointer schema migration breaks in-flight runs) and pitfall #2 (tool replay duplicates side effects) are the source for the failure-mode framing.
  2. [pa2025-idempotency] This site — Production Agents Deep Dive, Part 01: Idempotency & Safe Retries. Companion chapter on the Stripe pattern, key-generation rules, and error classification — the foundation that the idempotent_with_key class builds on.
  3. [pa2025-checkpointing] This site — Production Agents Deep Dive, Part 02: State Persistence & Checkpointing. Companion chapter on checkpoint shape, durable state, and the “shifts with no memory” framing from Anthropic, November 2025.
  4. [cci2026] tacit-web/research/cc-internals/src-analysis-05-agents-coordination.md — Cited from chapter 04 of this series for the worker-crash-mid-task forward reference; the failure mode coordinator mode defers is the one this chapter resolves.

Next chapter: 06 — Skills as Information Architecture, Not Features

One question for the reader: For every tool in your current harness, can you name its replay_class from memory? If “I would have to check the code,” the harness does not have the discipline yet — and the next time a worker dies mid-workflow, you will find out which class each tool actually belonged to.

Harness-engineering Ch 6/13
  1. 1 Harness Engineering — What This Series Is, and Why You Should Read It in Order 12m
  2. 2 What a Harness Actually Is (and What It Is Not) 20m
  3. 3 The Four Primitives Every Working Agent System Has 28m
  4. 4 The Reasoning Sandwich: Why More Thinking Made My Agent Worse 18m
  5. 5 Coordinator Mode: A Working Multi-Agent System, From the Source 32m
  6. 6 Replay Safety: The Bug That Breaks Every HITL Workflow 26m
  7. 7 Skills as Information Architecture, Not Features 22m
  8. 8 Prompt Cache Is Architecture: Designing Around the 50K-Token Mistake 22m
  9. 9 The Session-Memory Feedback Loop (ACE + Codified Context) 26m
  10. 10 The Org-Harness Thesis: Why Context Does Not Transfer 26m
  11. 11 The Numbers That Killed the 'Wait for Better Models' Excuse 14m
  12. 12 Build Your Own Harness: A 6-Week Plan for a 3-Person Team 30m
  13. 13 The Ten Pitfalls (and How to See Them Coming) 20m