Prompt Cache Is Architecture: Designing Around the 50K-Token Mistake

Prerequisite: Part 7 of the Harness Engineering deep dive.

A long horizontal bar split into a large stable cacheable prefix on the left and a small volatile dynamic tail on the right, divided by a vertical SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker. Below: a red lightning bolt strikes the prefix labeled 'cache break — 50–70K tokens lost,' with a callout listing the four guardrails: SYSTEM_PROMPT_DYNAMIC_BOUNDARY, DANGEROUS_uncachedSystemPromptSection, memoize(getLocalISODate), and tengu_prompt_cache_break. — Stable prefix, volatile tail — and the four guardrails preventing accidental breaks

Why This Matters

Most public writing about prompt caching treats it as a billing optimization — an off-by-default knob you flip on at the API layer once your bill gets big enough to notice. Claude Code treats cache stability as a load-bearing design constraint with code-level guardrails, because each accidental cache break costs ~50–70K wasted tokens of prefix recomputation, silently, every time it happens [cci2026-gems, §2].

That is the framing the open web mostly gets wrong. Treating the cache as a runtime concern — “we’ll see how it looks in the bill” — produces a harness that leaks the prefix on every turn that touches the system prompt and has no signal that anything is wrong. The bill is a lagging indicator. By the time the cost shows up, the code has already shipped, the cache-busting section has already become idiomatic, and the team is paying for the same 50–70K-token prefix recompute on every long-running conversation. Worse, the cache-write tier on Anthropic’s pricing is 1.25× the base input cost; the cache-read tier is 0.1×. The break does not just lose the 0.1× discount, it pays the 1.25× write premium on the very same tokens [anthropic-pricing]. Bill arithmetic that ignores the write premium understates the damage by an order of magnitude.

Claude Code’s source treats this as architecture. There is an explicit byte-level boundary in the system prompt — SYSTEM_PROMPT_DYNAMIC_BOUNDARY — that splits the cacheable prefix from the volatile tail. There is a hostile-by-design helper, DANGEROUS_uncachedSystemPromptSection(), that forces a written justification on any section that recomputes every turn. The session date is memoized at process start so it cannot drift across midnight and bust the prefix. Tool prompts change at monthly granularity, not per-deploy. And tengu_prompt_cache_break telemetry tracks per-tool schema hashes, beta headers, and fast-mode state so the team can see a regression the day it lands [cci2026-gems, §2]. None of this is runtime polish. It is the shape of the code, and the shape exists because each accidental break has a fixed five-digit token price.

Takeaway: Cache stability is not a billing optimization, it is a design constraint. Each accidental break costs 50–70K tokens at the write-premium tier; the right place to enforce stability is in the code, not in the invoice.

The 50–70K-Token Hidden Bill

The number in the source is direct: “~50-70K wasted tokens per break” [cci2026-gems, §2]. To see why this number is the bill rather than a one-time penalty, the three pricing tiers have to be named explicitly. Anthropic’s prompt-caching pricing has three tiers: a cache-read tier at roughly 0.1× the base input cost, a cache-write tier at roughly 1.25× the base input cost, and a no-cache tier at the base 1.0× input cost [anthropic-pricing]. The relevant comparison is not “with cache vs without”; it is “stable cache vs broken cache, on the same prefix, across the same conversation.”

A break has two parts. First, the prefix that was previously served at 0.1× must be re-written at 1.25×, because the next request after the break is a cache-write request rather than a cache-read request — a roughly 12.5× swing on the same bytes. Second, every turn between the break and the moment the new cache entry reheats pays no-cache or new-cache-write price on that prefix instead of the 0.1× read price; once the new entry is warm, the discount returns. For a 50–70K-token prefix, the immediate damage is the recomputation at 1.25× the base price; the trailing damage is the lost 0.1× discount across however many turns the prefix spends un-cached. A long conversation with a verbose system prompt amplifies the trailing damage linearly with turn count.

This is also why the cache-stability discipline lives at the prefix level and not at the tool-call level. A harness can be efficient with tool I/O, careful with context-window growth, aggressive with summarization, and still leak the prefix on every turn because one section of the system prompt was authored to recompute. The leak does not show up as a high tool count or a verbose response. It shows up as a higher-than-expected input-token bill on every cached conversation, and the only way to find it is to instrument cache-break events themselves [cci2026-gems, §2]. Anthropic’s effective-context-engineering guide makes the same point one layer up — write/select/compress/isolate are context-engineering disciplines, but they presuppose a stable prefix to be effective against [anthropic-context2025].

Manus, writing publicly about their own context-engineering practice, names cache stability as one of the load-bearing disciplines for any long-running agent — not because it improves the bill marginally, but because the agent’s effective context window is much smaller in practice than its nominal context window if the prefix is recomputed each turn [manus2025]. The implication for an operator is structural rather than tactical. If the harness lets the prefix drift, the whole context-engineering stack on top of it is paying full price on every interaction.

WHAT A CACHE BREAK ACTUALLY COSTS — THREE-TIER PRICING

Turn N (stable cache)               Turn N+1 (after break)
────────────────────                 ─────────────────────
prefix:  50–70K tokens               prefix:  50–70K tokens
       × 0.1× input cost                    × 1.25× input cost
       = cache READ                         = cache WRITE
                                            (forced recompute)

       delta vs. cache-hit on the same prefix:
       the break charges the 1.25× write tier on tokens that
       would have been served at 0.1× — a ~12.5× swing on the
       same bytes, charged once at break + lost 0.1× discount
       on every turn until the new cache entry reheats.

Trailing damage  =  cost (no-cache or new-write tier) × turns until reheat
Immediate damage =  one recomputation at the write-premium tier

Takeaway: A break is the 1.25× write premium on 50–70K tokens, plus the 0.1× read discount lost on every subsequent turn until the new entry reheats. The bill is the lagging signal; the code is where you stop the leak.

SYSTEM_PROMPT_DYNAMIC_BOUNDARY

The Claude Code source contains an explicit boundary marker, SYSTEM_PROMPT_DYNAMIC_BOUNDARY, that splits the static cacheable prefix from any dynamic content that must be assembled per turn — defined in constants/systemPromptSections.ts and referenced through constants/prompts.ts [cci2026-gems, §2]. The marker carries a warning attached in the source: “Do not remove or reorder without updating cache logic.” The marker is in the code rather than left as a coding convention because conventions are the first thing a refactor breaks. A marker is grep-able, lint-able, and impossible to “tidy up” without surfacing the dependency it represents.

The boundary expresses a simple invariant. Everything before the marker has to be byte-identical across turns within a session if the cache is to hit. Everything after the marker is allowed to vary — it pays full input cost on every turn, but it does not invalidate the prefix the way a single byte-level change earlier in the prompt would. Operator-side, the rule is mechanical: if a new section needs to be added to the system prompt, decide which side of the boundary it lives on before writing a line. If the answer is “I don’t know yet,” the answer is “after the boundary, until I can prove it is stable.”

This is why a boundary marker is more powerful than a style guide. A style guide tells the author what good practice looks like; a marker forces the author to make the cache-stability decision visible in the diff. Two distinct people reading the same patch will both see the boundary, and both will see immediately whether the patch sits in the cacheable prefix or in the volatile tail. The marker is the load-bearing artifact; the convention follows from the marker, not the other way around.

The Manus team makes the same point in their public write-up from a different angle — they argue that cache stability is a discipline rather than a feature, and that the discipline gets enforced wherever the code makes the boundary visible to the team [manus2025]. Claude Code’s boundary marker is the strongest form of that discipline because it is a named symbol in the source, not a comment.

SYSTEM PROMPT LAYOUT — BOUNDARY AS A CODE-LEVEL ARTIFACT

┌──────────────────────────────────────────────────────────────────┐
│  STATIC CACHEABLE PREFIX                                         │
│  ─────────────────────────                                       │
│  - core instructions                                             │
│  - tool prompts (monthly granularity)                            │
│  - memoized session date (frozen at process start)               │
│  - all sections NOT wrapped in DANGEROUS_uncachedSystemPromptSection
│                                                                  │
│  Property: byte-identical across turns within a session          │
│  Pricing: cache-read tier (0.1× base input)                      │
├──────────────────────────────────────────────────────────────────┤
│  ◀── SYSTEM_PROMPT_DYNAMIC_BOUNDARY ──▶                          │
│      "Do not remove or reorder without updating cache logic."    │
├──────────────────────────────────────────────────────────────────┤
│  VOLATILE TAIL                                                   │
│  ──────────────                                                  │
│  - per-turn dynamic context                                      │
│  - DANGEROUS_uncachedSystemPromptSection(_reason) blocks         │
│                                                                  │
│  Property: allowed to vary across turns                          │
│  Pricing: full input cost on every turn                          │
└──────────────────────────────────────────────────────────────────┘

edit before the marker  →  prefix changes  →  CACHE BREAK
edit after the marker   →  prefix stable   →  cache holds

Takeaway: A boundary marker is a code-level artifact, not a convention. It makes the cache decision visible in every diff; reviewers see immediately whether a patch sits in the prefix or the tail.

DANGEROUS_uncachedSystemPromptSection(_reason)

The companion helper to the boundary marker is DANGEROUS_uncachedSystemPromptSection(). It is the only sanctioned way to add a section that recomputes every turn, and it requires a string _reason argument explaining why [cci2026-gems, §2]. The naming pattern itself is the architecture: the function is DANGEROUS_ prefixed, so every call site is grep-able and the cost is visible in code review; the _reason parameter is the documentation that a future reader needs to evaluate whether the danger is still warranted.

Three patterns are doing work in the name. First, the DANGEROUS_ prefix signals to the author at the moment of writing — there is no way to call this helper without acknowledging that the call is a deliberate exception to the cache-stability default. Second, the _reason argument is structural, not advisory; a string is required, the function will not be invoked without one. Third, the underscore on _reason marks it as an intentionally-unused-at-runtime parameter — its job is documentation, not behavior. The compiler does not enforce that the reason is good, but the code reviewer can read the call site and decide.

This pattern generalizes well beyond Claude Code. Any always-on, easy-to-misuse feature of a harness — a tool that bypasses the policy gate, a context-window override, a retry counter that resets — benefits from the DANGEROUS_-with-_reason naming. The point is not the keyword DANGEROUS_; it is that the operation requires the author to write a justification at the call site, and that justification stays in the source for the next maintainer to evaluate. The replay-safety chapter (Ch05) established the same discipline at a different layer — replay_class is declared at tool registration and enforced at runtime [ai-gs2026, §6.2]. Cache stability is the same shape: a per-section declaration that the code enforces and the reviewer audits.

The opposite pattern — adding a section that recomputes every turn with no comment, no marker, and no audit trail — is the failure mode this naming prevents. Without the helper, the cost of a cache-busting section is invisible at the patch level; it shows up only in the bill, weeks later. With the helper, the cost is visible the first time the patch is read.

Takeaway: DANGEROUS_ prefix + required _reason argument is naming as architecture. The function’s signature forces every cache-busting section to carry its own justification, visible at the call site forever.

The Memoized Session Date

getSessionStartDate = memoize(getLocalISODate) is the canonical example of the discipline. The session date is captured once at process start and never recomputed for the lifetime of the conversation [cci2026-gems, §2]. The rationale comment in the source is direct: “Stale date after midnight vs ~entire-conversation cache bust — stale wins.”

The trade-off this comment names is worth holding in mind, because it inverts the default UX intuition. A naive implementation would call getLocalISODate() every turn, on the reasonable-sounding theory that the model should always see the correct current date. The Claude Code source takes the opposite view: a stale date for a conversation that runs across midnight is a smaller correctness problem than a cache bust on every long-running conversation. Stale date is a one-line edge case the model can recover from; entire-conversation cache bust is a 50–70K-token recurring tax on every long conversation.

The architectural decision is “freshness loses to stability when the cost-of-freshness is a prefix recompute.” That is a strong claim and worth examining. It works because the date is rarely load-bearing for the agent’s correctness — most tasks do not depend on the precise current date, and the few that do can read it from a tool call rather than from the system prompt. By contrast, the cache prefix is load-bearing on every turn — every interaction depends on the prefix being byte-identical to the cached version. Trading a rarely-load-bearing field for an always-load-bearing one is the right call, and the memoization makes the decision physical.

The same shape — “memoize anything date-like or environment-like that would otherwise be recomputed per turn” — generalizes to every section of the system prompt. Project paths, user time zones, environment variables, model version strings: anything that is captured at process start and remains stable for the lifetime of the session is a candidate for memoization. The pattern is one line of code, and the savings compound across the conversation. The memoization is not a performance optimization at the function-call level; it is a cache-stability primitive at the prompt level.

MEMOIZED SESSION DATE — STABILITY WINS OVER FRESHNESS

NAIVE (recompute per turn)              CLAUDE CODE (memoize once)
──────────────────────────              ──────────────────────────
turn 1, 23:45  date = 2026-05-12        turn 1, 23:45  date = 2026-05-12 ◀── captured
turn 2, 23:55  date = 2026-05-12                                            at process
turn 3, 00:05  date = 2026-05-13 ⚡     turn 2, 23:55  date = 2026-05-12    start;
             (one byte changed)       turn 3, 00:05  date = 2026-05-12    never
turn 4, 00:15  CACHE BREAK              turn 4, 00:15  date = 2026-05-12    recomputed
                                                                          
             + 50–70K tokens                                              
               at 1.25× tier                       no break               
                                                                          
Rationale (from source):                                                    
"Stale date after midnight vs ~entire-conversation cache bust — stale wins."

Takeaway: Memoize anything date-like or environment-like that would otherwise drift across a session. Freshness loses to stability when the cost-of-freshness is a 50–70K-token prefix recompute.

Monthly Granularity in Tool Prompts

Tool prompts in Claude Code change at monthly granularity, not daily and not per-deploy. The source comment is explicit about the rationale: “Changes monthly, not daily — minimizes cache busting” [cci2026-gems, §2]. The cadence is chosen for cache hit rate, not for prompt accuracy.

This is the move that is hardest to copy without internalizing the cost arithmetic, because the surface intuition runs the wrong way. Most teams would naturally update tool prompts as soon as they discover an improvement — a clearer phrasing, a new edge case, a fixed example. That instinct optimizes for tool-prompt quality. Claude Code’s instinct optimizes for cache hit rate across the user base. A weekly tool-prompt update is a weekly cache reset; a monthly cadence is twelve cache resets per year, not fifty-two. The difference is roughly a 4× reduction in cache-bust frequency for the same quality trajectory if the improvements are batched.

The discipline implied by monthly granularity is batching. Improvements to the tool prompts queue up; they ship together on a known cadence rather than as the developer notices them. The batching gate is mechanical — the system has a release cadence and tool-prompt changes are scoped to it — and the gate is enforced because every off-cadence change carries a visible cost (the bill spike) and a documented one (the tengu_prompt_cache_break telemetry event for the affected tool’s schema hash). Without the gate, the natural-but-wrong rhythm is “tweak when you notice”; with the gate, the natural rhythm is “batch and ship.”

Tool prompts are also the place where the cache-vs-quality trade-off is most often resolved in the wrong direction by inexperienced harnesses. A team that prizes tool-prompt accuracy will iterate constantly; a team that prizes cache stability will discover that the marginal accuracy gain from this week’s tweak is dominated by the bill increase from the cache reset. The right calibration is determined by the cost ratio, which depends on the per-tool schema hash being tracked in cache-break telemetry — visible in the source via tengu_prompt_cache_break [cci2026-gems, §2].

Takeaway: Monthly granularity is a batching discipline, not a freshness ceiling. The cadence is chosen for cache hit rate; tool-prompt improvements queue and ship together on a known clock.

Cache-Break Telemetry: `tengu_prompt_cache_break`

The detection system is tengu_prompt_cache_break — a dedicated telemetry channel implemented in services/api/promptCacheBreakDetection.ts that tracks per-tool schema hashes, beta headers, and fast-mode state on every cache-break event [cci2026-gems, §2]. (tengu_ is Claude Code’s internal analytics prefix; every telemetry event in the codebase is tengu_*-namespaced.) Telemetry is the only way to know whether the four upstream guardrails — boundary marker, dangerous-section helper, memoized date, monthly tool prompts — are holding in production. Code review catches the obvious cache breaks; only telemetry catches the subtle ones.

The three axes the telemetry tracks each correspond to a class of regression. Per-tool schema hashes catch the case where a tool’s input or output schema changed in a way the team did not realize was prefix-affecting — a renamed parameter, a reordered enum, a new optional field. Beta header changes catch the case where an API feature flag was flipped and silently re-keyed the cache. Fast-mode state catches the case where a per-request mode flip alters the served prompt without anyone explicitly changing the prompt source [cci2026-gems, §2].

The pattern that makes this telemetry usable is that the regression signal is the delta from steady-state, not the absolute break count. Every session has some baseline of unavoidable cache misses — the first turn of a new conversation, the first request after a deploy, the first request after a TTL expiry. The signal that something is wrong is a step change in the break rate per tool-schema-hash bucket, or a sudden appearance of a new bucket the team did not create. A dashboard that simply shows “cache breaks per hour” is too noisy; a dashboard that shows “cache breaks per (tool, schema_hash) cell over time” surfaces the regression on the day it lands.

Without telemetry, the architecture rots. The boundary marker can be edged around by a well-meaning refactor; the dangerous-section helper can be wrapped in a benign-looking abstraction that hides the call site; the memoized date can be re-derived by a different code path that calls getLocalISODate() directly. Each of these regressions is invisible in code review unless the reviewer was already looking for it. The telemetry channel is what makes the architecture self-auditing — the day a new schema hash starts appearing in the break events, the team has the receipt before the bill arrives. The replay-safety chapter (Ch05) applied the same shape at a different layer: a CI test that replays recent production checkpoints catches the regression at the gate rather than in front of a paying customer [ai-gs2026, §6.5]. Telemetry plays the same role for cache stability.

Takeaway: Telemetry is the only way to keep the architecture honest. Per-tool schema hashes plus a step-change alert on the per-bucket break rate catches the subtle regressions the boundary marker and naming conventions cannot catch alone.

What Generalizes Beyond Claude Code

The four guardrails — boundary marker, dangerous-section helper, memoized environment, monthly batching — are not Claude Code specific. They are the structural moves any agent harness with a long system prompt and a long-running session needs to make.

The boundary marker generalizes to any system prompt with both static and dynamic sections. Place an explicit, named symbol in the source between them. The marker should be grep-able, lint-able, and obvious in a code review. The name should communicate the constraint, not the implementation — SYSTEM_PROMPT_DYNAMIC_BOUNDARY rather than CACHE_SPLIT_POINT. A team that reads “do not remove or reorder without updating cache logic” knows what is at stake.

The DANGEROUS_-with-_reason helper generalizes to any sanctioned-exception path. Anything that bypasses a default — a tool that skips the policy gate, a checkpoint that bypasses validation, a section that bypasses the cache — should be added through a helper whose name is hostile and whose signature requires a justification string. The pattern is not specific to caches; the discipline of “exceptions must explain themselves at the call site” is the architecture, and the cache is one application.

Memoizing date-like and environment-like sources generalizes to anything that is computed once and used many times across a session. The candidate list is short and concrete: session start date, project paths, user time zone, model version, environment fingerprint, run-id. Any of these computed per turn is a cache-bust waiting to happen. Memoize at process start; expose a tool call for the rare case where freshness is genuinely load-bearing.

Monthly batching of tool-prompt updates generalizes to any change that affects the cacheable prefix. The cadence does not have to be monthly — it has to be predictable, visible, and gated. A team that ships prompt changes whenever a developer notices something is paying for the cache resets continuously. A team that ships them on a known clock pays a small handful of times per year and budgets accordingly. The batching is what makes the cost predictable; predictability is what makes it manageable.

Telemetry generalizes to any architectural invariant the team wants to keep honest. The pattern is “instrument the boundary, alert on the delta, dashboard by the most diagnostic axis.” For caches, the diagnostic axis is the per-tool schema hash. For replay safety, the diagnostic axis is the unsafe-on-replay error rate. For skills retrieval, the diagnostic axis is the per-turn description shelf and the model’s invocation choice. The shape is the same: instrument the architectural decision, not the symptom.

Takeaway: Boundary marker, hostile-helper-with-reason, memoized environment, monthly batching, instrument-the-architecture. Five primitives, one architecture, applies to any harness running a long system prompt against a long-running session.

Do This, Not That

Pattern	Naive	Correct	Why
Where to put cache-stability logic	Runtime feature flag, post-hoc	Code-level boundary marker + naming conventions reviewers can grep	The bill is a lagging indicator; the patch diff is the leading one
Adding a per-turn dynamic section	Insert anywhere in the system prompt	Wrap in `DANGEROUS_uncachedSystemPromptSection(_reason: "...")` and place after the boundary	The reason argument is the audit trail the next maintainer needs
Current date in the system prompt	Recompute per turn with `getLocalISODate()`	Memoize once at process start: `getSessionStartDate = memoize(getLocalISODate)`	Stale date < entire-conversation cache bust [cci2026-gems, §2]
Tool-prompt updates	Tweak when a developer notices an improvement	Batch on a monthly cadence; ship improvements together	12 cache resets/year beats 52, for the same quality trajectory. When to break the cadence: a security-critical or correctness-critical tool-prompt fix ships immediately; cache cost is not the trade-off worth making against unsafe or wrong behavior
Cache-break detection	Watch the input-token bill	Telemetry per `(tool, schema_hash)` cell with step-change alerting	Bill is noisy and lagging; per-bucket delta surfaces the regression on the day it lands
Mental model of the cache	”Optional billing optimization"	"Load-bearing prefix that the architecture enforces byte-identical across turns”	The 50–70K-token break is on every long-running conversation, not on outliers
Pricing arithmetic	”Cache reads are 10× cheaper, breaks lose the discount”	Name all three tiers: read 0.1×, write 1.25×, no-cache 1.0× — a break pays the write premium and loses the read discount [anthropic-pricing]	The write premium is the bigger half of the bill
Where a new system-prompt section goes	After the most-related existing section	Before vs after `SYSTEM_PROMPT_DYNAMIC_BOUNDARY`, decided in the patch	If the answer is “I don’t know,” the default is “after the boundary”
Boundary symbol in the source	A comment or a style-guide rule	A named symbol (`SYSTEM_PROMPT_DYNAMIC_BOUNDARY`) with a warning attached	Comments rot; named symbols are grep-able and lint-able
Cache-busting exception path	An ordinary helper named `dynamicSection()`	A hostile helper named `DANGEROUS_uncachedSystemPromptSection()` requiring `_reason`	Hostile names force the author to acknowledge the cost at the call site

Takeaway: Move every cache decision out of the runtime and into the source. Named boundary, hostile helper, memoized environment, batched updates, instrumented breaks. Five rules, all enforceable in code review.

Gotchas

Symptom	Cause	Fix
Input-token bill spikes after a “minor” system-prompt edit	The edit landed before `SYSTEM_PROMPT_DYNAMIC_BOUNDARY` and changed the cacheable prefix by a single byte	Re-locate the edit to after the boundary, or absorb the change into the next monthly tool-prompt batch
Long conversation across midnight suddenly gets expensive	Date in the system prompt is recomputed per turn rather than memoized	Memoize the session date at process start; treat date drift as a tool call, not a prompt field
Per-tool schema hash buckets multiplying in cache-break telemetry	A tool’s schema is being re-emitted with different canonicalization or field ordering across deploys	Canonicalize the tool-schema serialization (sorted keys, normalized types) before it is included in the prefix; pin one canonical form per tool
Bill increases linearly with turn count even though prefix “did not change”	A dynamic section was added inside the cacheable prefix without being wrapped in `DANGEROUS_uncachedSystemPromptSection`	Audit the prefix for non-deterministic content (timestamps, request IDs, env-derived strings); wrap or move them past the boundary
Cache breaks rise sharply after an unrelated API beta-flag flip	Beta header changes are part of the cache key; the flip silently re-keyed the prefix	Track beta headers in `tengu_prompt_cache_break`-style telemetry as a first-class axis; coordinate beta-flag changes with the cache-batching cadence
Fast-mode toggle inflates input-token cost	Fast-mode state alters the served prompt and re-keys the cache	Treat fast-mode flips as cache-busting events; include the fast-mode flag in telemetry and budget for the warm-up cost
A refactor “simplified” the prompt-assembly code and the bill quietly doubled	The refactor preserved the rendered prompt but reordered byte-level layout (e.g., moved a whitespace block)	Cache hits are byte-identical, not semantically-identical; re-render the pre- and post-refactor prompts and diff them at the byte level before merging
Telemetry shows no breaks but the bill is high anyway	The cache is hitting but the prefix itself is too large — the cost is the read tier × 50–70K tokens × N turns	The cache is doing its job; the next move is to shrink the prefix (see context-engineering disciplines in [anthropic-context2025])

Takeaway: Most gotchas reduce to: hidden non-determinism in the cacheable prefix, ordering-level changes that pass code review, and second-order axes (beta headers, fast mode) that re-key the cache without anyone editing the prompt source.

What Cache-as-Architecture Teaches About the Rest of the Series

The discipline this chapter names — make architectural invariants visible in the source, enforce them with naming, instrument the delta from steady-state — is the same discipline that shows up in the next two chapters. Session memory (see Ch08) is the same shape one layer up: the conversation accumulates observations that should be reflected back into a durable substrate without busting the prefix the next session reads from, which only works if the cache-stability primitive from this chapter is already in place. Build-your-own (see Ch11) treats the boundary-marker and dangerous-helper patterns as load-bearing for any harness that aspires to a long-running prefix. The skills chapter (Ch06) makes the same architectural argument at the retrieval layer — index in context, body on disk, harness as the gate — and the cache is the prefix-side analog of that same pattern.

Takeaway: Cache stability is a foundational invariant the rest of the series builds on. Session memory presupposes it; the build-your-own playbook makes it load-bearing; the skills retrieval gate is the same architectural shape one layer up.

References

[cci2026-gems] tacit-web/research/cc-internals/src-analysis-07-hidden-gems.md, §2 “PROMPT CACHE STABILITY — Obsessive Engineering,” 2026-04-01. Direct source analysis of /Users/ketankhairnar/Downloads/claude-code-src/. Files: constants/prompts.ts, constants/systemPromptSections.ts, services/api/promptCacheBreakDetection.ts. Primary source for: ~50–70K wasted tokens per break, SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker with “do not remove or reorder” warning, DANGEROUS_uncachedSystemPromptSection(_reason) helper, getSessionStartDate = memoize(getLocalISODate) with “stale wins” rationale, monthly granularity in tool prompts, and tengu_prompt_cache_break telemetry tracking per-tool schema hashes, beta headers, and fast-mode state.
[anthropic-pricing] Anthropic, “Prompt caching pricing.” https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching (retrieved 2026-04). Three tiers cited inline: cache-read ≈ 0.1× base input cost, cache-write ≈ 1.25× base input cost, no-cache 1.0× base input cost. Used inline for the break-cost arithmetic in §“The 50–70K-Token Hidden Bill” and the pricing row of the Do-this-not-that matrix.
[anthropic-context2025] Anthropic, “Effective context engineering for AI agents,” September 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents — Cited inline as the public framing for write/select/compress/isolate as context-engineering disciplines that presuppose a stable prefix, and in the closing Gotchas row on prefix-size reduction.
[manus2025] Manus, “Context Engineering for AI Agents: Lessons from Building Manus.” https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus — Public framing of cache stability as a discipline; cited inline in §“Why This Matters” and §“SYSTEM_PROMPT_DYNAMIC_BOUNDARY.”
[ai-gs2026] tacit-web/research/agent-infra/03-gold-seams.md — Gold-seams source map for production-agent harness patterns. §6.2 (replay-class declaration at tool registration) and §6.5 (CI checkpoint-replay regression gate). Cited inline in §“DANGEROUS_uncachedSystemPromptSection” and §“Cache-Break Telemetry” as the cross-layer precedent for declaration-at-registration plus instrumentation-as-self-audit.

Next chapter: 08 — The Session-Memory Loop

One question for the reader: If a colleague asked you to point to (a) the boundary marker in your harness’s source, (b) the list of DANGEROUS_-wrapped sections and their _reason strings, (c) the line that memoizes your session-stable environment, and (d) the cache-break telemetry dashboard, could you? If any of the four is missing, the harness is treating the cache as a runtime concern, not as architecture — and the 50–70K-token bill is landing on every long-running conversation, silently.