I/D/E · harness-engineering

The Session-Memory Feedback Loop (ACE + Codified Context)

Summary

Your agent learns nothing between sessions, and the 80% continuity ceiling is your fault — not the model's. Two independent research lines (ACE, Codified Context) plus a LangChain receipt describe the same loop: generator → reflector → curator. +10.6% on benches. 24.2% knowledge-to-code ratio. Here is how to build it.

Prerequisite: Part 8 of the Harness Engineering deep dive. Bridges into Ch09 — Org-Context Moat.

Sessions → patterns → durable memory → better sessions

Two independent research lines plus one industry receipt converge on the same loop. Generator produces; reflector extracts; curator promotes. The loop closes back into the generator's context.

Why This Matters

Most public writing about agent memory — including the LangChain memory blog and the Letta/MemGPT framing — treats it as a storage problem. Pick a vector DB, bolt on a RAG layer, configure a persistence backend, ship. That framing is wrong in a specific and operationally costly way: storage is the easy half. The receipt comes from the loop, not the substrate. Two independent research lines plus a LangChain industry receipt converge on the same architecture. None of them are about where the bytes live. They are about how sessions feed back into the prefix the next session reads from [boh-p4].

The ceiling in production today is around 80%. The most sophisticated hybrid approach the source surveyed — MCP memory plus session replay via git plus selective CLAUDE.md notes — achieves only about 80% continuity between sessions [boh-p4]. The missing 20% is not bytes that fell off the disk. It is the rejected alternatives the developer considered and discarded, the reasoning chains that produced this week’s directory layout, the “we tried X and it broke under Y” that the commit message did not preserve. That 20% is what makes a senior engineer expensive, and it is also the part that evaporates fastest when a session ends.

The cost is bigger than most teams budget for. Anthropic frames the problem as “a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift” [boh-p4]. Each context rebuild costs 10–15 minutes; multiple switches per day compound across the team; the aggregate productivity waste per enterprise is on the order of $4.5M per year [boh-p4]. The agent rebuilds; the developer rebuilds with it; the same dead ends get re-explored on Wednesday that were closed on Monday. The shift problem is not a memory-tooling gap, it is a feedback-loop gap.

This chapter is about the loop. Two research lines describe it formally — ACE [ace-arxiv] and Codified Context [boh-p4] — and LangChain’s Trace-Driven harness work delivers the same shape as an industry receipt (harness-only changes moved one bench from 52.8% to 66.5%) [boh-p4]. They use different vocabularies, run different experiments, target different deltas, and converge on the same three-stage cycle. Generator produces session output. Reflector extracts what generalized. Curator promotes the surviving patterns into durable memory. Next session’s generator inherits the upgraded prefix. The loop closes.

Takeaway: Memory is not storage. Storage is the easy half; the loop is the receipt — generator → reflector → curator — closing back into the next session’s prefix. The rest of this chapter is how to build it.

The 80% Continuity Ceiling

The 80% number is the upper bound the source surveyed across hybrid approaches in production [boh-p4]. The breakdown matters because the missing 20% is not noise — it is the structurally hardest part to preserve, and it is also the part that decides whether the next session repeats the previous one’s dead ends.

The 80% that does carry over is the artifact layer. Code, configuration, schemas, the final state of the workspace. A session replay against the git history reconstructs what the codebase looks like now; a CLAUDE.md captures the always-loaded preamble; an MCP memory server stores extracted facts. All three of those tools store outputs. Outputs are easy. The model can re-derive the next move from an output if the move is shallow.

The 20% that doesn’t is the reasoning layer. Three categories recur in the source [boh-p4]:

  • Rejected alternatives. The developer considered three approaches and picked one. The commit message says “use X.” It does not say “Y was rejected because of the read-amplification on the hot path, and Z was rejected because of the lock contention under concurrent writes.” Next session re-considers Y, re-discovers the read-amplification problem, and the team has paid for the same discovery twice.
  • Constraint chains. “We can’t use the obvious schema because the upstream service emits this denormalized form, and the downstream service needs ordering by the field that is missing from the obvious schema.” The code reflects the workaround. The chain that made the workaround necessary is in nobody’s head after the session.
  • Compaction-discarded reasoning. Most harnesses summarize when the context window fills. The Claude Code source compacts in a way that preserves architectural decisions, unresolved bugs, and implementation details, but discards redundant tool outputs [boh-p4]. The trade-off is principled, but the casualties include nuanced reasoning, the inference chain that connected two distant facts, and the alternatives the model considered before settling on the chosen approach.

This is what the Lore paper called the “decision shadow” [boh-p4]. Every commit is the visible output of an invisible process; AI agents are now both consumers of code seeking to reconstruct intent and generators of commits that only summarize diffs. Codebases grow faster while institutional-knowledge density decreases. The 20% gap is not a measurement of how much storage you bought, it is a measurement of how much of the process the storage captured.

Takeaway: The 80% that carries over is artifact-shaped — code, config, summaries. The 20% that doesn’t is reasoning-shaped — rejected alternatives, constraint chains, inference between distant facts. No amount of additional storage closes the gap. The loop closes the gap.

The 30-Day Cliff and the Shift Problem

Claude Code deletes local session files — the .jsonl files in ~/.claude/projects/ — after 30 days by default. The setting is cleanupPeriodDays in ~/.claude/settings.json, and most users discover the deletion after the fact [boh-p4]. Thirty days is short relative to the half-life of a non-trivial codebase decision. A constraint added in week one shows up as a mysterious workaround in week six, with no transcript to explain it. The cleanup is a default chosen for disk-budget sanity; the half-life of the knowledge it deletes is set by the project, not by the default.

Three failure modes share the cliff as their root cause [boh-p4]:

  • The Shift Problem. Anthropic frames it directly: “A software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Each new session is a new shift; the prior shift’s notes are not on the new shift’s desk.
  • The 10–15 minute rebuild tax. Each context rebuild — re-explaining the project, re-stating the conventions, re-walking the architecture — costs 10–15 minutes per session start, and compounds across the day [boh-p4]. Five sessions a day, three rebuilds saved is ~45 minutes of senior time recovered. Across a 20-person team, that is most of a headcount paid back by closing the loop.
  • The Solo-Developer Gap. Teams generate decision artifacts as a byproduct of collaboration — Slack threads, PR discussions, meeting notes. Solo developers using an agent do not. As one claude-code issue puts it: “the conversation IS the design meeting, and it evaporates when context clears” [boh-p4]. The shift problem is amplified at N=1, because there is no team-substrate accidentally catching the reasoning.

The operator move is mechanical: extend cleanupPeriodDays, externalize the high-value parts of the transcript before the cliff fires, and treat the conversation as a first-class artifact rather than an ephemeral side channel. The deeper move is to stop treating the transcript as the substrate at all — the loop’s curator stage should be promoting reasoning into a durable layer long before the 30-day clock matters.

Takeaway: 30-day default cleanup deletes session files; 10–15 minutes per rebuild compounds across switches; solo developers have no Slack thread to catch the reasoning. The fix is the curator stage, not a longer retention window.

ACE: Generator → Reflector → Curator

The Agentic Context Engineering paper (arXiv 2510.04618) is the most rigorous formalization of the loop the rest of this chapter describes [ace-arxiv][boh-p4]. The three stages are named explicitly: Generator produces session output, Reflector extracts what generalized from that output, Curator decides what gets promoted into durable memory. The cycle repeats with the curator’s output flowing into the next session’s prefix.

The numbers are the receipt. ACE reports +10.6% on agent benchmarks and +8.6% on finance tasks against a baseline that runs the same model with no feedback loop in place [boh-p4]. The model did not change. The instructions for each individual session did not change. The only variable is whether the prior sessions’ patterns were promoted back into the prefix for subsequent runs. That is the same shape as the 29-to-95 receipt from skills (Ch06) and the 50–70K-token cache-stability receipt (Ch07): the architecture moved the number, not the model.

The three stages map onto operator-side roles cleanly:

  • Generator is the working agent on a given task. Its output is the session transcript plus the artifacts the session produced. The generator is the only stage that touches the user’s real problem; the other two stages are pure metadata.
  • Reflector reads the generator’s session and asks: what would have helped the next instance of this kind of task? What pattern recurred? What rejected alternative cost the most time? What constraint surfaced late? The reflector’s output is candidate patterns, not yet promoted.
  • Curator decides which candidate patterns become durable. This is the gate. The curator is what stops the always-on prefix from drifting into an unstructured pile of every observation the reflector ever generated. Promotion costs prefix-budget; demotion costs nothing.

The reflector-curator split is the load-bearing decision. A naive implementation collapses the two: extract a pattern, append it to CLAUDE.md, ship. That implementation reproduces the 80% ceiling exactly, because there is no gate between “the reflector noticed something” and “the prefix grew by another paragraph.” The prefix accumulates until it crosses the cache-stability threshold (Ch07) or the skill-shelf knee (Ch06), and the receipt collapses.

The curator’s job is editorial. It accepts patterns the reflector surfaced, rejects ones that did not generalize, and edits the survivors into the durable substrate. Editing matters because patterns the reflector emits are session-shaped — they reference the file the generator was working on, the variable name the user typed, the specific symptom that triggered the insight. The curator generalizes: same pattern, restated so it applies to the next instance of the same kind of problem, not just the instance that produced it.

ACE LOOP — GENERATOR / REFLECTOR / CURATOR
            session N                       session N+1
                                  
                                               
                           
      GENERATOR    produces      durable    
      (working      transcript +         memory     
       agent)       artifacts            (prefix)   
                           
                                               
                                               
                            
      REFLECTOR    candidate     CURATOR   
      (extracts      patterns             (gate +   
       patterns)                          editor)   
                            

Receipt (ACE, arXiv 2510.04618):
  +10.6% on agent benchmarks
  +8.6% on finance tasks
  same model both runs; the loop is the variable.

Failure mode: collapse REFLECTOR + CURATOR into one step.
              The prefix bloats; the 80% ceiling holds.

Takeaway: ACE names the three stages and ships the receipt — +10.6% / +8.6%, same model. The reflector-curator split is the load-bearing decision; collapse them and the prefix bloats until cache stability or shelf-knee dynamics eat the gain.

Codified Context: 24.2% Knowledge-to-Code Ratio

Codified Context (arXiv 2602.20478) studies the same loop empirically on a single project [boh-p4]: 283 sessions across 70 days on a 108K-line C# codebase. It reports a knowledge-to-code ratio of 24.2% — for every four lines of code, the project produced roughly one line of codified reasoning. The ratio is the budget, and it is high enough that operators who have never measured it usually under-budget by an order of magnitude.

The paper names three sub-stages, which sit inside the curator stage from ACE:

  • Detection. Agent confusion signals a missing spec. The generator stalls, asks for clarification, or produces a candidate that violates an unwritten constraint. The signal is the trigger to codify, not the developer’s hunch.
  • Codification. Knowledge is documented in the same session as the implementation. This is the discipline that produces the 24.2% ratio — the codifier writes while the reasoning is fresh, not in a separate “documentation sprint” that never happens.
  • Maintenance. Biweekly 30–45 minute review passes audit the codified context against the codebase. Stale entries are pruned, drifted ones are corrected, gaps surfaced by recent confusion signals are filled. The cadence is calendared because the alternative — audit when convenient — collapses to never.

The receipt that anchors the 24.2% ratio is the save-system result [boh-p4]. After documenting the save-system specification once, that document was referenced in 74 sessions and 12 agent conversations, enabling consistent application across all of them with zero persistence-related bugs. One pass through the codification stage; 86 downstream sessions inheriting the work; zero regressions of the class the document was written to prevent. That is the compounding the loop produces when the curator stage fires.

Three operator implications follow:

  • The 24.2% ratio is the budget, not the upper bound. Teams that have never measured the ratio of codified reasoning to shipping code typically run at single digits — 2–5% — and they pay for it as the 80% ceiling. The 24.2% is what a well-run codified-context discipline produces on a 108K-line codebase over 70 days; it is the steady state of the loop, not the upper bound.
  • Drift detection lives in the maintenance stage. Codified Context describes a context-drift detector that flags source changes without corresponding spec updates [boh-p4]. The detector is the only thing that prevents the codified layer from rotting silently. It is the analog of the cache-break telemetry from Ch07: instrument the architectural invariant, alert on the delta.
  • Codification happens in-session, not after. The single biggest predictor of whether codification ever happens is whether it is treated as part of the implementation or as a separate task. Same-session codification produces the 24.2% ratio; deferred codification produces zero.
CODIFIED CONTEXT — DETECTION / CODIFICATION / MAINTENANCE
  session in flight

generator stalls / asks / violates constraint
     
       (Detection: confusion is the trigger)

 CODIFICATION (same session)                     
 - write the missing spec while it is fresh      
 - bind to the code that motivated it            

                     
                     

 MAINTENANCE (biweekly, 30–45 min)               
 - audit codified context vs current code        
 - drift detector flags un-updated specs         
 - prune stale, correct drifted, fill new gaps   


Receipt (Codified Context, arXiv 2602.20478):
  283 sessions, 70 days, 108K-line C# codebase
  knowledge-to-code ratio: 24.2%
  one save-system spec  74 sessions + 12 agent convos
  persistence-related bugs in the reuse window: 0

Operator number to track: % of sessions that produced a codified
artifact in the same session. Below 20%  loop is not running.

Takeaway: 24.2% knowledge-to-code is the budget the loop needs. Detection on confusion, codification in-session, maintenance on the calendar, drift detector on the source. One save-system spec compounds across 74 sessions with zero regressions — same shape as ACE, measured on a real codebase.

Memory as Filesystem: The Shape Both Lines Imply

Both ACE and Codified Context describe the architecture in their own vocabulary, but the durable substrate they each presuppose is the same: a filesystem-shaped memory the agent can list, read, and verify. The agent has structured paths; the harness gates retrieval; the namespace is browsable the way a project directory is browsable. Neither paper names this property as its centerpiece, but the load-bearing artifact in both — ACE’s “Curator” output, Codified Context’s “specification documents” — sits on disk as named files the agent can address.

The reason filesystem-shaped memory works is the reason every other index-in-context, body-on-demand primitive in this series works. A filesystem-shaped memory has three properties:

  • A small, retrievable index — directory listings, file names, summaries the agent reads first.
  • Large, on-demand bodies — the full content of any specific note, fetched only when the agent asks for it by name.
  • A gate — the harness, which decides what is in the index, what is on disk, and what the agent is allowed to read in this scope.

That is the same architecture as the skills layer (Ch06), restated for memory. It is also the same architecture as the prompt cache (Ch07), restated for the durable substrate rather than the prefix. The reason it works at the memory layer is the reason it works at every other layer: the candidate set the agent reasons over is the index, not the body, so the working set stays small at any moment while the catalog can grow large.

The contrast with the alternatives is what makes the architecture load-bearing:

  • Opaque memory blobs (vector DB, monolithic CLAUDE.md, single-row MCP key-value) put the body in context every time the index is consulted. The agent pays attention cost on bytes it did not request and may not need. This is the eager-loading failure mode from skills, restated [chroma-rot].
  • Pure RAG retrieves chunks by embedding similarity without an agent-visible index. The agent cannot list what is available, cannot browse, cannot verify that a particular note exists. Retrieval becomes a guess; the agent has no way to fail-soft to “no, this note is not in scope, I will reason without it.”
  • Filesystem-shaped memory lets the agent list, read targeted files, and verify gaps. The retrieval is the agent’s own decision over a visible namespace. The harness is the gate, but the namespace is browsable.

The other property a filesystem-shaped memory inherits naturally is versioning. If memory lives in git-trailer commits, in dated spec files, or in ~/.claude/projects/*.jsonl snapshots, the agent (and the auditor) can see when a note changed, why, and what the prior version said. Versioned memory is what makes Codified Context’s maintenance stage auditable — drift is a diff, not a feeling.

The shape matters specifically for multi-step tasks, which is the regime where the doom loop (next section) is most expensive. A multi-step task has internal state the agent needs to carry across steps; if the state lives only in the context window, the agent re-derives it after every compaction. If the state lives in a filesystem-shaped memory the agent can read, the state survives.

Takeaway: Memory-as-filesystem is the same index-in-context, body-on-demand, harness-as-gate primitive from Ch06 (skills) and Ch07 (cache prefix) — applied to the durable substrate. Versioned, browsable, gated. The architecture is what makes the agent’s memory survive multi-step tasks.

Where the Two Lines Agree

ACE and Codified Context were written by different teams, target different deltas, and do not cite each other. They converge on the same architecture anyway, which is the strongest evidence in the source that the loop is real rather than an artifact of one team’s writeup. LangChain’s Trace-Driven harness work — published as an industry receipt rather than as a framework paper — ships the same shape and the same kind of result (a harness-only change moved one bench from 52.8% to 66.5%) [boh-p4].

Two axes of agreement matter operationally:

The three-stage cycle. ACE names generator / reflector / curator [ace-arxiv]. Codified Context names detection / codification / maintenance [boh-p4]. The vocabulary is different; the shape is identical. A team building this once does not need to pick one — the two are overlapping cuts through the same architecture, and LangChain’s trace-analyzer-skill implementation is the working version of the same loop applied to error traces rather than to specifications [boh-p4].

Three layers of memory. The source synthesizes the architecture into three explicit tiers [boh-p4]:

  • Hot memory (always loaded). CLAUDE.md, system prompts, constitution files. Hardened, verified patterns. For the 108K-line C# codebase the source studied, this layer was ~660 lines.
  • Warm memory (on-demand). Specialized agent specs, domain expert configs, recent session summaries. Triggered by file patterns or explicit queries.
  • Cold memory (searchable archive). Full session transcripts, git history with decision trailers, specification documents. Accessed through retrieval services.

The 660:108,000 ratio is the load-bearing budget — the always-loaded layer is roughly 0.6% of the codebase, not 6% or 60%. Hot memory pays the cache-stability price (Ch07) and the skill-shelf knee price (Ch06) on every turn, so it has to stay small. Warm memory is the filesystem-shaped substrate the previous section described — large, browsable, gated. Cold memory is the substrate the maintenance stage of Codified Context audits against.

What each line adds the other does not:

  • ACE adds the formal cycle and the benchmark deltas. It is the cleanest statement of the loop as an abstract architecture and the only one of the two with bench numbers against a no-loop baseline.
  • Codified Context adds the in-session discipline and the empirical ratio. The 24.2% knowledge-to-code ratio and the save-system result anchor the loop in a real codebase over 70 days, which the more abstract treatment does not.

ACE tells you the stages and the abstract receipt. Codified Context tells you the discipline and the ratio. LangChain’s harness work tells you it ships against a production bench. Build the stages, hold the discipline, and the loop closes against your own bench rather than a paper’s.

Takeaway: Generator/reflector/curator (ACE) = detection/codification/maintenance (Codified Context). Three tiers — hot ~0.6% of codebase, warm on-demand, cold archived. Two frameworks plus one industry receipt are cuts through one architecture, not three architectures.

Do This, Not That

PatternNaiveCorrectWhy
Memory substrateVector DB or monolithic CLAUDE.md, treated as storageThree tiers — hot (~0.6% of codebase) always-loaded, warm filesystem on-demand, cold archived with retrievalStorage is the easy half; the receipt is from the loop closing back into the prefix [boh-p4]
Reflector and curatorCollapse into one step (“extract a pattern, append to CLAUDE.md”)Two distinct stages — reflector emits candidates, curator gates promotion and edits for generalityCollapsing produces unbounded prefix growth and reproduces the 80% ceiling [ace-arxiv][boh-p4]
When to codifyDocumentation sprint, after the feature shipsIn the same session as the implementation, triggered by detection signals (confusion, stalls, constraint violations)Same-session codification produces the 24.2% knowledge-to-code ratio; deferred codification produces zero [boh-p4]
Audit cadence for codified contextQuarterly review or “when we have time”Biweekly 30–45 minute pass plus an automated drift detector on source changesThe detector catches silent rot; the calendar cadence ensures the audit fires [boh-p4]
Memory API surfaceOpaque get/set with a keyFilesystem semantics — list, read, write, version, scope. When copying breaks: if the agent class below GPT-4o/Sonnet 4.5 cannot reliably traverse a multi-path namespace, flatten to a single tier and treat the gain as conditional on model classList-read-write-version semantics let the agent browse and verify the namespace rather than guessing, which is what Codified Context’s maintenance stage and ACE’s curator stage both presuppose [boh-p4]
Session retentionAccept the 30-day cleanup defaultExtend cleanupPeriodDays and externalize high-value reasoning before the cliff via the curator stageThe cliff is fine if the curator already promoted what mattered; pure retention extension defers the problem, doesn’t solve it [boh-p4]
CompactionTrust the default summarizer to preserve reasoningTreat compaction as lossy on reasoning chains and route rejected alternatives + constraint chains through the reflector before the window fillsThe compactor preserves architectural decisions and bugs; it discards inference between distant facts [boh-p4]
What goes in hot memory”Anything the team thinks is important”Hardened, verified patterns only; budget capped near 0.6% of codebaseHot memory pays the cache-stability price every turn; bloat there bypasses every gain from Ch07
Solo developer workflowConversation = design meeting, evaporates on clearReflector runs at session-end; conversation transcript is a first-class artifact, not a side channelThe solo gap is the absence of an accidental team-substrate; the loop has to be explicit because there is no Slack thread catching it [boh-p4]

Takeaway: Three tiers, reflector/curator split, in-session codification, biweekly audit with drift detector, filesystem semantics on the substrate. The matrix is what separates a harness running the loop from one paying the 80% ceiling.

Gotchas

SymptomCauseFix
Hot memory grows every week, cache hit rate dropsAuto Memory–style promotion with no curator gate; reflector candidates promoted unconditionallyInsert a curator stage that rejects candidates that did not generalize; cap hot-memory budget near 0.6% of codebase; treat promotion as a deliberate edit, not an append [boh-p4]
Agent iterates 10+ times on the same approach without progressDoom loop — agent lacks implicit memory of action history within the session [boh-p4]Instrument per-file edit counts in the harness; surface the count back to the agent after N edits; route into a verification pass before the next attempt
/remember-style auto-extraction “captures” things but the agent doesn’t improveAuto-memory captures surface-level corrections (“use camelCase”), not deeper understanding (“this validator runs before the auth check because the JWT decode happens at the gateway”) [boh-p4]Author reflector-stage prompts that surface reasoning chains and rejected alternatives, not just deltas the user typed; the captured signal has to match the shape of the missing knowledge
Compaction fires and the next message loses the “we already considered Y” contextDefault compactor preserves architectural decisions, unresolved bugs, and implementation details, but discards redundant tool outputs and nuanced reasoning chains [boh-p4]Pre-empt the compactor — route rejected alternatives and constraint chains through the reflector into warm memory before the window fills, so the durable layer carries the reasoning the compactor will drop
Solo developer rebuilds the same context every sessionThe conversation was the design meeting and it evaporated on context clear [boh-p4]Run an explicit reflector at session end (or /diary-style invocation); the solo gap is the absence of a team-substrate, and the only fix is to instantiate one deliberately
200K context window fills faster than expectedRoughly ten files plus a 30-minute conversation can exceed 100K tokens; older information gets pushed out silently [boh-p4]Treat context-window budget as a cache-stability problem (Ch07) plus a memory-tier problem — move warm-memory contents out of the always-loaded prefix; surface them via filesystem semantics on demand
Codified spec says use library X, codebase already migrated to YNo drift detector; maintenance cadence is the calendar, not the source [boh-p4]Run a drift detector that flags source changes without spec updates; tie the audit to the commit that changed the underlying convention, the way the skills audit ties to the convention’s commit (Ch06)
Knowledge-to-code ratio is in single digitsCodification is treated as a separate task and never gets prioritizedCodify in the same session as implementation; track the ratio as a leading indicator; below 20% means the loop is not running [boh-p4]

Takeaway: Auto-memory ≠ learning; doom loops are an action-history gap; compaction is lossy on reasoning chains; the solo gap requires an explicit reflector; context-window overflow is faster than intuition. Most gotchas reduce to “the loop is not running, and the symptoms only show up in the bill or the bench.”

What Session-Memory Teaches About the Rest of the Series

The loop named here is the same architectural primitive Ch06 and Ch07 describe, restated at the durable-substrate layer. One pattern, three layers, three different receipts (29→95 pass rate from skills, 50–70K-token bill from a single cache break, +10.6%/+8.6% bench deltas from the closed loop). Ch09 — Org-Context Moat picks up the cold-memory tier as the team’s durable asset and asks what compounds when the loop runs across the team rather than across one developer’s sessions.

Takeaway: Index-in-context, body-on-demand, harness-as-gate is the load-bearing primitive across the series. Skills apply it to descriptions, cache applies it to prefix bytes, session memory applies it to the durable substrate. Ch09 picks up the moat thread.

One question for the reader: Could you point to (a) the reflector that runs at session-end in your harness, (b) the curator gate that decides what gets promoted to hot memory, (c) the drift detector that fires when a source change does not have a matching codified update, and (d) the line that caps hot-memory budget near 0.6% of the codebase? If any of the four is missing, the harness is treating memory as storage rather than as a loop — and the 80% ceiling is the receipt.

References

  1. [boh-p4] tacit-web/research/building-org-harness/phase4-session-memory.md — Phase 4 source map, dated 2026-05. Primary source for: the 30-day cleanup default and cleanupPeriodDays setting (§1, line 5); the Shift Problem framing from Anthropic (§1, line 9); the 10–15 minute per-rebuild cost (§1, line 11); the Auto Memory ≠ Learning finding from Brent W. Peterson (§3, line 37); the Solo-Developer Gap framing from claude-code issue #15222 (§3, line 41); the 80% continuity ceiling on hybrid approaches (§3, line 45); ACE’s framework formalization and +10.6%/+8.6% deltas (§5, lines 85–86); Codified Context’s three-tier system and 24.2% knowledge-to-code ratio across 283 sessions / 70 days / 108K-line C# codebase (§5, lines 88–95); LangChain’s Trace-Driven 52.8%→66.5% receipt (§5, line 98); the $4.5M/year enterprise productivity waste (§7, line 123); the save-system result — 74 sessions + 12 agent conversations with zero persistence-related bugs (§7, line 131); the doom-loop mechanics and 10+ identical-approach iterations (§8, line 137); the three-layer memory architecture synthesis — hot ~660 lines for 108K-line codebase, warm on-demand, cold archived (Key Synthesis, lines 156–160).
  2. [ace-arxiv] Agentic Context Engineering (ACE), arXiv 2510.04618. https://arxiv.org/abs/2510.04618 — Generator / Reflector / Curator formalization. Cited inline for the three-stage cycle and the +10.6% agent-benchmarks / +8.6% finance-tasks deltas as the cleanest abstract statement of the loop.
  3. [chroma-rot] Chroma Research, “Context Rot.” https://research.trychroma.com/context-rot — Documents long-context retrieval degradation and the candidate-set effect on natural-language selection. Referenced in the filesystem-memory framing as the broader context-window evidence that motivates browsable substrate over always-loaded blobs.

Next chapter: 09 — The Org-Context Moat

Harness-engineering Ch 9/13
  1. 1 Harness Engineering — What This Series Is, and Why You Should Read It in Order 12m
  2. 2 What a Harness Actually Is (and What It Is Not) 20m
  3. 3 The Four Primitives Every Working Agent System Has 28m
  4. 4 The Reasoning Sandwich: Why More Thinking Made My Agent Worse 18m
  5. 5 Coordinator Mode: A Working Multi-Agent System, From the Source 32m
  6. 6 Replay Safety: The Bug That Breaks Every HITL Workflow 26m
  7. 7 Skills as Information Architecture, Not Features 22m
  8. 8 Prompt Cache Is Architecture: Designing Around the 50K-Token Mistake 22m
  9. 9 The Session-Memory Feedback Loop (ACE + Codified Context) 26m
  10. 10 The Org-Harness Thesis: Why Context Does Not Transfer 26m
  11. 11 The Numbers That Killed the 'Wait for Better Models' Excuse 14m
  12. 12 Build Your Own Harness: A 6-Week Plan for a 3-Person Team 30m
  13. 13 The Ten Pitfalls (and How to See Them Coming) 20m