Skills as Information Architecture, Not Features | Intentional / Deliberate / Engineering

Prerequisite: Part 6 of the Harness Engineering deep dive. Companion piece on the content of skills (what to put inside one): Encoding the Senior Engineer in the Room — a Design Memo for Tacit Skills. This chapter is about the retrieval mechanics (when and how the harness loads them).

Closed cabinet of descriptions, one drawer open at a time

Closed cabinet: descriptions only in context. Open drawer: one full body loaded on demand. The architecture is the lever.

Why This Matters

Most public writing about “skills” treats them as a feature you enable. Click a checkbox, drop a markdown file in a folder, ship a starter pack. The framing carries an implicit assumption: more skills available means a more capable agent. On the public benchmark that established skills as a primitive, the same Claude Code with the same task suite moved from a 29% pass rate to a 95% pass rate based on nothing but which markdown files were in the context, and when [lch-skills2026]. The model did not change. The instructions did not change. The architecture of how the instructions were retrieved changed.

Three framings dominate the open web, all wrong. The first is “skills are tools by another name.” This conflates two distinct mechanisms — tools have a calling convention and a return value, skills are instructional prose loaded into the system prompt — and obscures the only reason skills work, which is that they exploit both lazy loading and natural-language retrieval where tools exploit neither. The second is “skills are just system-prompt fragments.” This collapses the two-stage retrieval (description-in-context, body-on-demand) into a single bucket and treats the architecture as if it were an organizational nicety rather than a load-bearing performance lever. The third is “more skills means a more capable agent.” This is the marketing framing — count the skills, list them in a feature comparison — and it inverts the actual mechanic, because a context full of eagerly-loaded skill bodies degrades retrieval, raises cost, and produces the worse of the two numbers in the receipt.

Skills are an information-architecture layer of the harness. They sit between the static system prompt and the dynamic context window. They have a public surface — the description field — and a private surface — the full body. The harness owns the gate that decides when to open the second one. Treat skills as architecture, not features, and the 29-to-95 receipt becomes operationally obvious. Treat them as features, and you can ship a hundred of them while leaving most of the gain on the table.

Takeaway: Skills are not a feature. They are the information-architecture layer of the harness. The retrieval mechanic — description-in-context, body-on-demand — is what produces the receipt.

The 29% → 95% Number, Decoded

The number comes from LangChain’s “Skills” write-up, published March 2026 [lch-skills2026]. Same Claude Code build, same defined task suite, same instruction content — the only variable is whether the skill files were loaded eagerly into the system prompt or kept in the file system with only their descriptions surfaced to the model. Eager-load run: 29%. Lazy-load run: 95%. The delta is the largest single receipt on the harness-engineering side of the public bench landscape [hwc2026, Finding 4]; chapter 10 of this series treats it as the largest of six receipts and uses it as the high-water mark for “harness change, model held constant” [see also /deep-dives/harness-engineering/10_numbers_that_prove_it].

The decomposition matters because the surface comparison is unintuitive. Two reasonable observers can both read “Claude Code went from 29 to 95” and infer that the new build added a new capability — a tool, a model upgrade, a new system prompt. None of that is what happened. The capability was already present in both runs. The instructions for using the capability were already written. The only change was whether those instructions sat inside the context window from the first token or arrived in the context window only when the model asked for them by name.

That distinction is the entire mechanic. Eager loading pays the cost of every skill on every turn — token cost on the input bill, attention cost when the model has to skim past skill bodies that have nothing to do with the current task, retrieval cost when the relevant fragment is buried inside a hundred kilobytes of irrelevant prose. Lazy loading pays the cost only when the skill is needed. The model sees a small set of one-paragraph descriptions, picks the one whose description matches the task, asks the harness to open the body, then operates inside the body. The harness, not the model, owns the gate.

A second-order detail explains why the magnitude is so large. The skills surface is a discovery layer. The model is not searching a vector database, it is reading short prose that the harness has placed where the model expects instructions to be. The retrieval is the model’s own decision, made in natural language, against a small enough set of candidates that the retrieval is almost always correct. The instant the candidate set grows large enough that “almost always correct” becomes “frequently wrong,” the receipt would collapse. The architecture is what keeps the candidate set small at any moment in time.

Takeaway: Same model, same task suite, same instruction content. The receipt is from where and when the instructions lived. Eager load: 29. Lazy load: 95.

Progressive Disclosure as a Primitive

Progressive disclosure is borrowed from UX research — Nielsen Norman’s original framing — and it names a specific pattern: surface the smallest amount of information that allows a user to make the next decision, defer everything else until the user requests it [boh-p1, §3]. The Claude Code skills implementation is the canonical application of that pattern to agent harnesses. Anthropic’s own authoring guide describes a three-level disclosure model with author-side budgets at each level: metadata kept tight enough that many descriptions can sit in context at once (the guide targets ~100 tokens per skill description, always in context), full instructions kept compact enough to fit comfortably in a single turn (the guide targets under 5,000 tokens, loaded when invoked), and bundled resources — referenced files, code, templates — loaded only if the body asks for them [anth-skills-bp]. The numbers are author targets, not runtime-enforced ceilings; they exist to keep the shelf cheap and the bodies focused.

The harness owns the gate at every level. At level one, the harness scans a known set of paths — user, project, plugin-provided, built-in — and assembles the descriptions into the system prompt. The model never sees the body at this stage. At level two, the model produces a request — by name — for a specific skill. The harness reads the file, places the body into the next context window, and the model continues. At level three, the body’s instructions reference auxiliary files (a regex pattern reference, a code template, an example output); those load only if the model’s working trajectory requires them.

The mental model that fails for this is “the system prompt is one big string.” Under that mental model, you measure a system prompt by its token count and you decide what to put in it by what is “important enough.” Progressive disclosure inverts that. The system prompt is two layers: a small, always-loaded index of capabilities, and a much larger body that is reachable from the index but not present in it. The index lives in context. The bodies live in the filesystem. The harness is the retrieval mechanism between them.

PROGRESSIVE DISCLOSURE — RETRIEVAL FLOW

time ────────────────────────────────────────────────────────▶

   turn 1                     turn 2                 turn 3
     │                          │                       │
     ▼                          ▼                       ▼
┌──────────┐             ┌──────────────┐         ┌──────────┐
│ model    │             │ model picks  │         │ model    │
│ reads    │ ──────────▶ │ skill by name│ ──────▶ │ executes │
│ INDEX    │             │ harness      │         │ inside   │
│ (~200T)  │             │ loads BODY   │         │ BODY     │
└──────────┘             └──────────────┘         └──────────┘
     │                          │
     │                          │ on next unrelated turn:
     │                          ▼
     │                  ┌──────────────────┐
     └─────────────────▶│ BODY drops out;  │
                        │ INDEX stays      │
                        └──────────────────┘

 INDEX  = description shelf      (always in context, cheap)
 BODY   = full instruction body  (in filesystem, loaded on demand)
 GATE   = the harness            (decides when to open a body)

The disclosure pattern is what makes the harness, not the model, the architectural unit. A model with a 200,000-token context window does not save you from eager loading: it just lets you make the mistake at a larger scale. The harness is the only place in the stack where the retrieval gate lives, because the harness is the only component that knows what is on disk, what is currently in context, and what the model just asked for.

Takeaway: Progressive disclosure is not a sizing decision. It is a primitive — index in context, body in filesystem, harness as the gate. Anything that puts the body in context defeats the primitive.

Description-as-API-Surface

The description field of a skill is the public interface. It is what the model sees when scanning for capability. It is what the model uses to decide whether a skill applies to the current task. If the description does not match the situation, the skill might as well not exist — the model will never request the body. The description, not the body, is the API surface, and the implication for operators is direct: write descriptions for retrieval, not for humans.

This makes a skill description structurally analogous to a function signature in a typed API. A function signature tells the caller what the function does without showing them the implementation. The caller picks the right function by reading signatures. They do not skim implementations and pick by appearance. The same applies to skills: the model picks skills by reading descriptions, and it does not skim bodies first. A description that buries the action in a paragraph of context is a function whose name does not tell you what it does — present but invisible.

The first sentence of a description does most of the work. The model scans descriptions densely; the first sentence is what gets sampled. A description whose first sentence reads “Used for analyzing data” might match any skill that touches data. A description whose first sentence reads “Convert a SQL query into a pandas dataframe pipeline” is unambiguous about the situations where it applies. The second sentence can carry preconditions, scope, anti-patterns; the model uses those to disambiguate between similar skills. Beyond two or three sentences, the description starts paying the same eager-loading tax that the body would. The discipline is a paragraph at most, ideally tight enough to read in a single beat.

Naming is the other half of the surface. The skill name is what the model invokes by; the description is what it picks by. Together they form the retrieval key. Verb-first names work because tasks are verbs (“draft”, “validate”, “summarize”, “translate”). Noun-only names like “Postgres” or “API” force the model to read the description to know what the skill does, which costs more tokens and produces noisier retrieval. The companion piece on the content of skills [tacit-skills] argues the same point about questioning discipline inside a skill: structure produces the result, not prompt length. Description-as-API-surface is the structural rule that applies to the boundary.

Takeaway: The description is the API. Verb-first names, single-paragraph descriptions, first-sentence summary, written for retrieval not for humans. Treat the description like a function signature, and the body like the implementation behind it.

Why “Fewer Tools Beat Many Tools” Generalizes to Skills

“Fewer tools beat many tools” is not a universal law — it is an empirically-anchored heuristic about decision cost. The mechanism is the candidate-set: every additional tool in scope on a turn adds a small disambiguation tax to every other tool the model is choosing between. Anthropic’s effective-context-engineering guide argues the same point as a context-engineering principle, not as a hard rule — keep the active tool set small and sharply-defined, because overlapping tools force disambiguation, rarely-used tools pay the always-on context tax for nothing, and ambiguous schemas degrade decision accuracy on adjacent tools [anth-ctx2025]. The harness-engineering SSR cites this as Finding 4 and treats it as a generalizable mechanism rather than a fixed prescription [hwc2026, Finding 4]. The receipt that anchors the same mechanism applied to skills is the 29-to-95 number; the LangChain write-up frames it explicitly as the tools heuristic restated for instructions [lch-skills2026]. Two surfaces, same candidate-set dynamic, same receipt-shape.

The mechanism is the same in both cases. A model deciding between two options pays a small attention cost. A model deciding between twenty options pays an attention cost that grows non-linearly: each additional option degrades the model’s certainty on every other option, because the candidate set itself is one of the things the model is reasoning about. Tools have schemas; the model reads the schema and decides. Skills have descriptions; the model reads the description and decides. The number-of-candidates dynamic is identical, and so is the failure mode — too many candidates, none of them clearly the right one, model picks wrong.

What makes the skills case more forgiving than the tools case is the lazy-loading floor. With tools, every tool’s schema sits in context for every turn; there is no analog to “keep only the description, defer the body.” Skills decouple the candidate-set size from the working-set size. You can have fifty skills available on disk and still keep only the five or ten descriptions in context that are relevant to the current task — because the harness, not the model, decides which descriptions to surface. The decoupling is the architectural payoff.

This is also why “more skills available” is a misleading metric. If the harness keeps all fifty descriptions in context all the time, you have re-invented the eager-loading problem at one layer up. The right metric is the working-set: how many descriptions are in context at the moment the model is deciding, and how tightly are those filtered to the task. A harness that surfaces descriptions conditionally — by project, by file type, by the user’s invocation pattern — beats a harness that surfaces all of them and a harness that surfaces none. The receipt comes from filtered descriptions plus lazy bodies, not from the lazy bodies alone.

CONTEXT BUDGET — TEN SKILLS, TWO ARCHITECTURES

Ten skills available. Each body ~3,500 tokens. Each description ~150 tokens.

EAGER LOADING (anti-pattern)
────────────────────────────
┌─────────────────────────────────────────────────────────────┐
│ [skill body × 10]   ~35,000 tokens consumed                 │
│ ████████████████████████████████████████████████████████░░░░│
└─────────────────────────────────────────────────────────────┘
→ no headroom for task content; receipt regresses

LAZY LOADING (the receipt)
──────────────────────────
┌─────────────────────────────────────────────────────────────┐
│ [descriptions × 10]  ~1,500 tokens                          │
│ [active body  × 1]   ~3,500 tokens                          │
│ ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ ~5,000 tokens consumed; headroom preserved for task content │
└─────────────────────────────────────────────────────────────┘
→ 7× cheaper, retrieval correct, receipt holds

 Working set is what counts, not catalog size.

Takeaway: Fewer-tools-beat-many is a context-budget principle. Skills inherit it. Lazy loading lets you have a large catalog and a small working set at the same time — which is the only way to avoid re-inventing eager loading at the skills layer.

Naming for Retrievability

The skill is retrieved by name and description. Naming is therefore one half of the public surface, and getting it wrong silently breaks the receipt. The pattern that works is a verb-first name that compresses to a short phrase, paired with a description whose first sentence states the action and the typical input. The pattern that fails is a noun name that the model has to translate into an action.

Verb-first names are read as commands by the model: summarize-pr, validate-schema, draft-rfc, audit-permissions. The model reads the name and immediately knows it can be invoked when the current task is a PR summary, a schema validation, an RFC draft, or a permissions audit. Noun-only names like pull-request or schema force the model to first identify the implied action, then map that action to the situation — an extra inference hop on every retrieval. Over a turn, that overhead is invisible; over a thousand turns, it produces measurable retrieval drift.

The single-paragraph description discipline matters because the model is reading the description, not embedding it. There is no vector similarity step; there is no reranker. The model parses prose and decides. A description that reads “Use this skill to do thoughtful work on documents. Considers the user’s intent and applies appropriate analysis.” is structurally invisible — it could fit any task. A description that reads “Convert a SQL query into a pandas dataframe pipeline. Use when the user provides a query string and asks for analysis in Python. Not for declarative reporting or BI dashboards.” names the action, the input, the situation, and the boundary. The same number of tokens; the second one is retrievable, the first is not.

Examples of when to use a skill should appear in the description, not only in the body. The model uses examples to disambiguate between similar skills. Two adjacent skills — analyze-logs and parse-logs — are distinguished by their examples more than by their names. The description for analyze-logs cites “a stack trace with a partial root cause” as the canonical input; parse-logs cites “a raw log file that needs to be tokenized into structured rows.” The examples are doing retrieval work that the names cannot do alone.

Anti-patterns recur. A description that opens with “This skill helps the user with…” wastes the first sentence on a frame the model does not need. A description that lists every supported input format in the first paragraph drowns the action. A skill named after the file format it consumes (pdf, csv) tells the model nothing about what to do with the format. A description that ends with “Please use this skill responsibly” is the LLM-tell of a generated description and produces nothing the model can retrieve on. Cut all four.

Takeaway: Verb-first name. First sentence states action + input. Examples in the description carry the disambiguation work. Anti-patterns are mechanical and recur — read your own descriptions out loud and cut anything that does not narrow retrieval.

The Three Skill Failure Modes

Three distinct failure modes recur in production. Each has a different symptom and a different fix. Conflating them is the most common operator mistake.

Wrong skill loaded (description mismatch). The model invokes a skill that is adjacent to the task but not the right one — parse-logs when the user asked for analyze-logs, validate-schema when the user asked for draft-schema. The symptom is a body that the model has to reinterpret on the fly, with output that bears the shape of the wrong skill’s structure. The fix is at the description, not the body: the two descriptions overlap, and the disambiguator is missing. Add examples to both descriptions that distinguish the canonical inputs.

No skill loaded when one applies (discovery gap). The model produces a from-scratch answer when a skill in the filesystem would have produced a better one. The symptom is silent — the output is worse than it should be, and the operator does not see that a skill was bypassed unless the harness logs which skills were considered. The fix is two-step. First, instrument the harness to log the description-shelf contents and the model’s invocation choice on every turn (this is the analog of the cache-break telemetry from chapter 07). Second, if the right skill was on the shelf and not picked, the description is mis-keyed for the situation — rewrite the first sentence. If the right skill was not on the shelf at all, the filtering layer is excluding it — relax the filter.

Right skill loaded but stale or contradictory. The model picks the right skill, opens the body, and the body contains instructions that no longer match the project’s reality. The symptom is output that follows the skill but produces a regression — the skill said to use library X, the codebase has moved to Y. The fix is operational, not architectural: skills are durable artifacts, and the harness that hosts them needs an audit cadence. Tie the audit to the artifact, not to the calendar — when the codebase changes the convention the skill encodes, the skill goes stale on that commit. The companion piece on org-context chapter 09 makes the case that this audit cadence is the moat-side of the same architecture.

Two upstream modes round out the picture. The first is model class: below the GPT-4o / Sonnet 4.5 tier, natural-language retrieval is unreliable and the model picks by surface similarity. The receipt was generated against Claude Code, which sits at the top of that class; downgrading the model without rewriting descriptions tighter is the failure mode behind a lot of “skills did not work for us” reports. The second is shelf size. There is a knee somewhere in the mid-tens of descriptions where even a top-class model stops doing reliable action-matching and starts surface-matching — the same dynamic that context-rot research catalogues for retrieval over long context [chroma-rot]. The fix is the working-set discipline already named: filter descriptions before they reach the shelf, push less-relevant skills behind a coarser router. The receipt is silent above the knee; the operator playbook is to instrument and stay below it.

Takeaway: Wrong skill, no skill, stale skill. Three failure modes, three different fixes. Description mismatch, discovery gap, audit cadence. Conflating them is the most common operator mistake — the wrong fix on the wrong mode wastes a sprint.

What Generalizes Beyond Skills

The information-architecture principle that makes skills work generalizes to every other always-on component of the harness. The pattern is the same in each case: keep a small, retrievable index in context, defer the body to a filesystem-like substrate, let the harness gate the retrieval.

Tools follow the same pattern. The four-primitives chapter argues that the active tool set should be small, schema-tight, and disambiguated by name [see /deep-dives/harness-engineering/02_four_primitives]. The receipt anchoring “fewer tools beats many tools” is the tools-side analog of the 29-to-95 number. Same mechanism, different surface. The right move at the tools layer is not to ship a single mega-tool, it is to keep the candidate set small at any moment in time — exactly the principle the skills layer expresses through its description shelf.

Sub-agents follow the same pattern. The coordinator-mode chapter shows that a coordinator does not see its workers’ transcripts — it sees a description on spawn and a summary on return [see /deep-dives/harness-engineering/04_coordinator_mode]. The full body of the worker’s reasoning lives in the worker’s isolated context; the coordinator gets the description. Same architecture: index in context, body in another scope, harness as the gate.

Memory follows the same pattern. The session-memory chapter argues for memory-as-filesystem: a hierarchy of files where the agent sees the index — a directory listing, summaries, headers — and opens the body only when a specific note is required [see /deep-dives/harness-engineering/08_session_memory_loop]. This is the GCC paper’s contribution made architecturally explicit. The 29-to-95 number lives at the skills layer; the +29 percentage points on SWE-Bench-Lite from GCC lives at the memory layer; the mechanism behind both is the same.

Prompts themselves follow the same pattern at a smaller scale. A system prompt can be modular — common preamble always present, situational sections gated by routing. CLAUDE.md files in a Claude Code project work this way: the top of the file is always in context, deep sections are referenced and loaded conditionally. The companion blog post on tacit skills [tacit-skills] argues the structural point: every degree of freedom you remove improves the output. Information architecture is the discipline that removes degrees of freedom on retrieval without removing them on capability.

Takeaway: The IA principle applies to tools, sub-agents, memory, and modular prompts. Index-in-context, body-on-demand, harness-as-gate. Every layer of the harness is either honoring that pattern or paying the eager-loading tax.

Do This, Not That

Pattern	Naive	Correct	Why
Where the skill body lives	Eager-load all bodies into the system prompt	Body on disk, description in context, harness opens body on invocation	The 29-to-95 receipt is exactly this swap [lch-skills2026]
What to put in the description	Whatever helps a human understand the skill	First sentence states action + canonical input; one paragraph max; examples for disambiguation	The description is the API surface; humans are not the audience
Skill naming	Noun after the domain (`postgres`, `api`)	Verb-first action (`audit-permissions`, `draft-rfc`)	The model reads names as commands; verbs are read in one hop
Skill catalog growth	Add a skill for every new use case	Add a skill only when the description is unique and the body is reusable	More skills means a larger candidate set unless filtering is also tightened
Which descriptions are surfaced	All of them, always	Filtered by project, file type, or invocation context	Catalog size is not the same as working-set size
Audit cadence for stale skills	Quarterly review on the calendar	Audit on the commit that changes the convention the skill encodes	Skills are durable artifacts; staleness is event-driven
Discovery gap diagnosis	”The model is not smart enough”	Instrument the harness — log description shelf + invocation choice per turn	Discovery gap is silent without telemetry; symptom is “worse output,” not “error”
When to add an example to a description	Once the body is final	Whenever two adjacent skills share a verb or domain	Examples carry the disambiguation work names cannot do
Sub-skill resources (level 3)	Inline into the body	Reference by path; load only when the body’s trajectory requires it	The three-level disclosure model exists for a reason [anth-skills-bp]
Bench claims for skills	”Skills make the agent better"	"Lazy-loaded skills produced N → M on suite S, same model”	The mechanism is the architecture, not the catalog

Takeaway: Description-is-API, verb-first naming, working-set ≠ catalog, audit on the commit not the calendar. The table is the operator playbook.

Gotchas

Symptom	Cause	Fix
Skills “did not move the number” on our bench	All skill bodies eagerly loaded; the harness has no body-on-demand gate	Move bodies to disk; surface descriptions only; load the body inside the turn the model invokes the skill
Model picks the adjacent skill, not the right one	Two descriptions share their first sentence; examples are missing	Rewrite first sentence of both to name the disambiguating input; add one canonical example per description
Right skill loaded; output follows a stale convention	Audit cadence is the calendar, not the commit	Tie the skill’s audit to the commit that changes its underlying convention; treat staleness as an event, not a schedule
Output regressed after we added more skills	Description shelf grew unbounded; filtering layer absent	Filter descriptions by project / file type / invocation context; the working set, not the catalog, is what the model sees
Skills “work in dev, fail in prod”	Dev instance loads a small set; prod loads a curated list of fifty	Re-test the receipt on the prod working set; if the prod set is 50 descriptions, the candidate-set dynamic dominates
First-sentence summaries are all generic (“This skill helps with…”)	Descriptions were authored by humans, for humans, not for retrieval	Strip frames; first sentence is action + input; examples carry the situation; cut anti-patterns mechanically
Discovery gap invisible — output is bad but no error	Harness does not log the description shelf or invocation choice	Instrument: every turn, log the shelf contents and which skill the model picked (or chose to skip)
Retrieval quality cliffs after the shelf passes some N	Shelf-size knee — past a working-set threshold (mid-tens of descriptions for the Sonnet/GPT-4o class on the public bench evidence), natural-language retrieval starts surface-matching rather than action-matching; context-rot research describes the same degradation for the broader context window [chroma-rot]	Stop adding descriptions to the shelf at the project tier. Push less-relevant skills to a deeper tier and route to them by a coarser filter (file-type, invocation pattern) before they reach the model
A new “harmless” skill silently re-routes existing tasks	Description injection — a description authored to match too many situations (or, adversarially, to claim the model’s attention) wins retrievals it should not	Treat every description as part of the public surface area. Code-review descriptions the same way you review tool schemas. Log routing decisions and watch for drift after any description add

Takeaway: Most gotchas reduce to: bodies on disk, descriptions filtered, audit on the commit, telemetry on the shelf, shelf-size kept below the knee, descriptions code-reviewed like schemas. The architecture is the lever; the operator discipline keeps it sharp.

What Skills-as-IA Teaches About the Rest of the Series

The retrieval gate that makes skills work is the same gate that makes the prompt cache work — both depend on a stable prefix and a deterministic decision about what to load when [see /deep-dives/harness-engineering/07_prompt_cache_as_architecture]. The skill catalog is also a durable artifact in the org-harness sense — every skill is a piece of codified context the team owns and the model does not [see /deep-dives/harness-engineering/08_session_memory_loop]. Chapter 09 picks up the moat-side of that thread.

Takeaway: Skills are the smallest unit of org-harness context that has a published mechanic and a public receipt. Get the architecture right here, and the same primitive carries into cache, memory, and the moat.

References

[lch-skills2026] LangChain, “Skills,” March 2026. https://blog.langchain.com/langchain-skills/ — Claude Code task pass rate 29% → 95% with progressive-disclosure Skills loaded; same model across both runs. Source #5 in the harness-engineering SSR source map; primary anchor for the 29-to-95 receipt cited inline.
[hwc2026] Harness-engineering SSR — tacit-web/research/harness-engineering-deep-agents-ssr.md, Phase 4 Finding 4 (progressive disclosure as transformative; “fewer tools outperform many tools” finding applied to instructions). Internal research synthesis dated 2026-03-10. Used inline as the cross-domain framing source.
[anth-skills-bp] Anthropic, “Skill authoring best practices.” https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices — Three-level progressive disclosure: metadata (~100 tokens per skill, always in context), full instructions (under 5,000 tokens, loaded on invocation), bundled resources (loaded if the body requires them). Used inline for the disclosure-model framing and as the official source for the per-skill token budgets.
[anth-ctx2025] Anthropic, “Effective context engineering for AI agents,” September 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents — Write/Select/Compress/Isolate framework and the principle of small, sharply-defined toolsets; cited as the public source for the fewer-tools-beat-many framing this chapter generalizes to skills.
[tacit-skills] This site — Encoding the Senior Engineer in the Room — a Design Memo for Tacit Skills. Companion piece on the content of skills (structure, questioning discipline, output format). This chapter is the retrieval mechanics companion; both ship side by side.
[boh-p1] tacit-web/research/building-org-harness/phase1-frameworks-tools.md — Source map for the skills literature, including the Nielsen Norman UX origin of progressive disclosure (via the SwirlAI write-up cited in §3 of phase 1).
[chroma-rot] Chroma Research, “Context Rot,” https://research.trychroma.com/context-rot — Documents the degradation curve of long-context retrieval and the candidate-set effect on natural-language selection. Cited inline as the empirical analog for the shelf-size knee described in the Gotchas table.

Next chapter: 07 — Prompt Cache Is Architecture: Designing Around the 50K-Token Mistake

One question for the reader: If a colleague asked you to demonstrate the skills layer in your harness, could you point to (a) the on-disk path that holds the bodies, (b) the description shelf the model actually sees on a given turn, and (c) the log line that records which skill was invoked? If any of the three is missing, the harness is treating skills as a feature, not as architecture — and the receipt is leaking on every turn.