Flue Under the Hood: Why This Agent Harness Holds
Senior engineers have seen enough frameworks die to mistrust the launch sentence. The interesting question is not “what does this framework do?” The interesting question is: what bet does it make that will still make sense after the API churn?
Flue calls itself “The Agent Harness Framework.” Its README says it is like Claude Code, but headless and programmable: no TUI, no GUI, just TypeScript. That sentence is a good hook, but it is not the reason Flue is interesting.
The reason is the boundary it chooses:
Flue owns the harness layer and rents the model loop.
That is the design decision that makes the framework worth studying. Flue
does not try to become another model SDK. It lets @mariozechner/pi-ai
and @mariozechner/pi-agent-core carry provider metadata, streaming
semantics, tool-call execution, model catalogs, and the lower-level agent
loop. Flue builds above that: sessions, skills, roles, sandboxes,
compaction, run records, deployment targets, and the public HTTP surface.
That split is what turns “Claude Code, but headless” from a slogan into an architecture.
The model loop is the rented backend. The sticky framework surface is the harness control plane around it.
The model loop decides the next step. The harness owns the durable runtime behavior around that loop.
The Snapshot
I checked the repo through the GitHub API on 2026-05-16. The current
main head I used is dbaa9eff, the merge commit for
PR #130, “Add run registry,
admin API, and OpenAPI specs.” The runtime package is
@flue/runtime at version 0.5.3; @flue/sdk remains as a separate
client/migration surface. The README still warns that Flue is
experimental and APIs may change.
So treat every file path here as a versioned observation, not a permanent contract. The learning goal is the method: read the runtime boundary, then reason about features from that boundary.
The files that matter for the mental model:
.flue/agents/*.ts
│ handler(ctx) calls ctx.init(...)
▼
┌────────────────────────────────────────────────────────┐
│ Harness │
│ packages/runtime/src/harness.ts │
│ owns sessions, fs, env, open child sessions │
└──────────────┬─────────────────────────────────────────┘
│ session()
▼
┌────────────────────────────────────────────────────────┐
│ Session │
│ packages/runtime/src/session.ts │
│ owns SessionHistory, prompt(), skill(), task(), shell() │
└───────┬─────────────────────┬──────────────────────────┘
│ wraps │ uses
▼ ▼
┌─────────────────┐ ┌────────────────────────────────┐
│ pi-agent-core │ │ compaction.ts │
│ Agent │ │ threshold + overflow recovery │
└────────┬────────┘ └────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ agent.ts + sandbox.ts │
│ built-in tools, task tool, SessionEnv, SandboxApi │
└────────────────────────────────────────────────────────┘
deploy surface:
┌────────────────────────────────────────────────────────┐
│ cli build plugins + runtime/flue-app.ts │
│ Node, Cloudflare, run registry, /runs/:runId, OpenAPI │
└────────────────────────────────────────────────────────┘
That is the article in one diagram. Your agent file is thin. The harness is where the product lives.
The Ubiquitous Language
The best way to teach Flue is not as “a pile of TypeScript files.” It is a domain model for headless agent execution. Domain-driven design gives us the right move: define the language first, then read the code through that language.
Here is the vocabulary I would use across every Flue training, blog post, diagram, and deep dive:
| Term | Meaning in the Flue domain | Why it matters |
|---|---|---|
| Agent file | User-authored TypeScript handler under .flue/agents. | It declares the job, but should not own runtime mechanics. |
| Invocation | A request to run an agent with an identity, input payload, and runtime context. | It is the external command entering the domain. |
| Harness | The aggregate root for a run-capable agent environment: sessions, filesystem, env, sandbox, and child sessions. | It is where Flue owns behavior instead of delegating to the model SDK. |
| Session | A scoped execution conversation with tools, roles, model settings, and persisted history. | It is the unit of agent work. |
| Session history | The tree of entries behind a session, with an active path and leaf. | It makes replay, compaction, task sessions, and deletion explainable. |
| Active path | The path from root to current leaf that becomes the model-visible conversation. | It separates stored history from runnable context. |
| Tool contract | A schema plus runtime behavior that the model can rely on. | It is not documentation; it is a promise the harness must keep. |
| Session environment | The capability boundary exposed to tools: exec, file operations, sandbox behavior. | It is where agent intent meets runtime reality. |
| Provider seam | The boundary where Flue resolves models, provider settings, API keys, and payload overrides before pi-ai executes. | It lets Flue rent the model loop without surrendering harness semantics. |
| Compaction | A state transition that replaces old active-path context with a summary entry. | It is context management, replay safety, and failure recovery. |
| Run | A concrete execution record produced by an invocation. | It gives headless work an inspectable identity. |
| Run registry | The lookup surface that maps run IDs back to agent instances and stores. | It replaces the missing human watching a terminal. |
| Build target | The generated runtime shape for Node, Cloudflare, CI, or another host. | It makes deployment part of the framework contract. |
If a training uses these words consistently, the source code becomes much easier to read.
Agent Definition Context .flue/agents/*.ts │ declares intent ▼ Harness Context init(), sessions, fs, env, child sessions │ opens work scope ▼ Session Context SessionHistory, active path, roles, compaction │ asks for model/tool steps ├──────────────► Provider Context │ pi-ai, pi-agent-core, provider payloads │ ├──────────────► Execution Context │ tools, SessionEnv, SandboxApi │ └──────────────► Observation Context run store, run registry, events, OpenAPI Deployment Context wraps the whole system: Node, Cloudflare, CI, generated entries, bindings, config
This language matters because naming drift creates bad training. If “session” sometimes means transcript, sometimes run, and sometimes HTTP request, the reader cannot build a mental model. In Flue, those are different domain objects. A session owns conversation state. A run owns execution identity. A provider owns model transport. A sandbox owns runtime capabilities. The framework holds because those words do not collapse into one another.
The Bet: Own The Harness, Rent The Loop
Open packages/runtime/src/session.ts and the layering is explicit:
import { Agent } from '@mariozechner/pi-agent-core';
import type { Model, UserMessage, AssistantMessage } from '@mariozechner/pi-ai';
Later, Session constructs the lower-level Agent:
this.harness = new Agent({
initialState: {
systemPrompt,
model: this.config.model,
tools,
messages: previousMessages,
thinkingLevel: this.config.thinkingLevel ?? 'medium',
},
getApiKey: provider => this.getProviderApiKey(provider),
onPayload: (payload, model) => this.applyProviderPayloadOverrides(payload, model),
toolExecution: 'parallel',
sessionId: options.affinityKey,
});
That code is the seam. Flue feeds pi-agent-core the current system prompt, model, tool set, message history, provider key resolver, and payload override hook. Then pi-agent-core runs the model loop.
What does Flue own around that loop?
| Layer | Owned by Flue | Rented from pi-ai / pi-agent-core |
|---|---|---|
| Model catalog | resolveModel, provider registration, Cloudflare binding attachment | Provider/model shape, stream payload semantics |
| Conversation state | SessionHistory, stores, active path, compaction entries | Message/content types consumed by providers |
| Tool surface | built-in read/write/edit/bash/grep/glob/task, custom tools, connector tools | Agent tool execution lifecycle |
| Runtime boundary | SessionEnv, SandboxApi, Node/Cloudflare/local sandbox adapters | The model deciding when to call a tool |
| Deployment surface | CLI build plugins, flue(), run store, run registry, OpenAPI | Nothing; this is Flue’s framework layer |
Flue can set transport options such as provider retention while keeping session memory as the harness-owned record.
This is the same build-vs-buy answer I argued for in the harness-engineering series: own the interfaces that encode your system’s semantics; rent the replaceable backends. Models change. Provider APIs change. But sessions, tools, sandboxes, compaction, and run history are the control plane. That is where product behavior accumulates.
The best example is the OpenAI Responses store flag. Flue can expose a
provider setting like storeResponses without pretending that hosted
provider memory is the same thing as Flue session memory. The payload hook
is deliberately at the pi-ai seam:
if (settings?.storeResponses === true) {
return { ...(payload as Record<string, unknown>), store: true };
}
That is small code, but it teaches the boundary: provider retention is a transport setting; Flue’s session tree remains the harness record.
Flue vs. A Claude-Code-Style Harness
Flue invites the comparison itself: “like Claude Code, but 100% headless and programmable.” The useful way to read that sentence is not “Flue is Claude Code.” It is: Flue extracted the harness pattern and made different product choices.
Shared harness primitives ┌───────────────────┬────────────────────────────────────────────┐ │ Markdown context │ AGENTS.md, skills, roles, instructions │ │ Filesystem │ read/write/edit/shell as agent affordance │ │ Delegation │ task/subagent shape, child context │ │ Tool loop │ model chooses actions, runtime executes │ └───────────────────┴────────────────────────────────────────────┘ Product split ┌─────────────────────────────┬──────────────────────────────────┐ │ Claude-Code-style harness │ Flue │ ├─────────────────────────────┼──────────────────────────────────┤ │ human operator in the loop │ app/server/CI invokes it │ │ terminal-first workflow │ TypeScript-first runtime │ │ local worktree default │ Node, Cloudflare, CI, connectors │ │ conversation is the product │ agent endpoint is the product │ │ inspect by watching │ inspect through run APIs/logs │ └─────────────────────────────┴──────────────────────────────────┘
The difference matters. Interactive coding agents can rely on a human operator as part of the runtime. Flue cannot. A support agent on Cloudflare, a CI triage agent, or a remote coding agent invoked over HTTP needs a different surface:
- A stable request route:
/agents/:name/:id - A stable run identity:
/runs/:runId - A run registry so the caller does not need to know the owning agent and instance for later lookup
- A public OpenAPI spec and a mountable admin API
- Sandboxes that can be virtual, local, Cloudflare-backed, or remote
Interactive tools can rely on a person watching the transcript. Flue has to make run lookup, events, and streams first-class.
That is why PR #130 matters. It is not just “an admin API.” It is Flue continuing the product split: headless agents need a runtime inspection plane because there is no TUI operator staring at the transcript.
Three Decisions That Make It Stick
The Flue codebase is young, but the durable design is already visible. Three choices are load-bearing.
1. A session is a tree, not a transcript
In SessionHistory, the current state is the active path to a leaf:
getLeafId(): string | null
getActivePath(): SessionEntry[]
getActivePathSince(afterLeafId: string | null): SessionEntry[]
That sounds like implementation detail until you ask what replay, compaction, and child tasks can safely do. A transcript is append-only. A tree gives you branches, active paths, compaction entries, task session metadata, and deletion of subtrees.
The user-facing lesson is simple: Flue does not merely store messages. It stores a navigable execution history. That is the difference between a chat wrapper and a harness.
A flat transcript can only grow. A tree lets Flue remove the unsafe tail, append a summary node, and retry from a safe active path.
When the deep-dive series exists, this section should point to chapter 1:
session.ts, the active path, and replay safety.
2. Compaction is an algorithm, not a prompt trick
packages/runtime/src/compaction.ts splits the work into named stages:
deriveCompactionDefaults(...)
calculateContextTokens(...)
shouldCompact(...)
prepareCompaction(...)
compact(...)
The session calls it in two modes:
- threshold compaction, when context approaches
contextWindow - reserve - overflow recovery, when the provider reports a context overflow, Flue removes the failed assistant leaf, compacts, and retries
The important construct is not summarization alone. It is measuring, cutting, appending a summary entry, and recovering from overflow.
That second path is the important one. It means compaction is not just a cost optimization. It is failure recovery. The session tree makes that recovery possible because Flue can remove the failed leaf and re-derive the active context.
When the deep-dive series exists, this section should point to chapter 3:
compaction.ts, cut points, summarization cost, and overflow retry.
3. Deployment is part of the framework, not an example folder
Flue’s runtime is not only prompt(). It has build plugins, generated
entry points, default Hono apps, Cloudflare Durable Objects, run stores,
run registries, and OpenAPI documents. On current main,
packages/runtime/src/runtime/flue-app.ts mounts:
GET /openapi.json
POST /agents/:name/:id
GET /runs/:runId
GET /runs/:runId/events
GET /runs/:runId/stream
Deployment is part of the framework because the host changes what adapters, stores, and bindings have to exist.
This is what “framework” means. The framework is allowed to care about where the agent runs, how you inspect it later, and what stable protocol callers use. That is not incidental. It is the product.
When the deep-dive series exists, this section should point to chapter 5: the run store, run registry, public API, admin API, and reconnectable run streams.
The Twelve-Factor Reading
The original Twelve-Factor App was written for software-as-a-service applications, not agent harnesses. Still, the lens is useful if we translate it carefully. Flue is not just modeling an agent loop. It is modeling a deployable, inspectable, configurable runtime.
| Factor | Flue training narration | Construct to study |
|---|---|---|
| Codebase | One agent source should build into multiple runtime shapes. | Agent files, runtime package, CLI package, generated entries |
| Dependencies | Rented layers stay explicit: pi-ai and pi-agent-core are dependencies, not hidden copies. | @flue/runtime, @flue/cli, package boundaries |
| Config | Provider keys, bindings, target choices, and model settings belong at runtime boundaries. | provider registration, getApiKey, Cloudflare bindings, app composition |
| Backing services | Models, stores, sandboxes, registries, and remote connectors are attached resources. | provider seam, stores, SandboxApi, run registry |
| Build, release, run | Building an agent is not the same thing as invoking an agent. | CLI build plugins, generated Node/Cloudflare entries, runtime app |
| Processes | A running handler should not be the only memory of the work. | session stores, run stores, run IDs, active paths |
| Port binding | Headless agents need a service surface, not a terminal transcript. | Hono app, /agents/:name/:id, /runs/:runId, OpenAPI |
| Concurrency | Parallel tool execution and multiple runs must preserve scoped state. | toolExecution: 'parallel', sessions, run identity |
| Disposability | Long-running agents need timeout, retry, and recovery semantics. | bash timeout propagation, stream handling, overflow compaction |
| Dev/prod parity | Local, CI, Node, and Cloudflare should teach the same harness concepts. | build targets, adapters, local and remote sandbox modes |
| Logs | A headless run needs event streams and replayable history. | run events, stream routes, run store, run registry |
| Admin processes | Inspection and maintenance should be first-class runtime operations. | admin API, one-off runs, registry lookup, generated SDK surface |
The more headless the agent becomes, the more the harness has to behave like a cloud application.
This is the richer narration: Flue’s constructs are not accidental. They are the agent-harness version of cloud-native pressure. Config has to move out of the code path. Services have to be attachable. Runs need durable identity. Logs have to become event streams. Admin inspection cannot be an afterthought. The more headless the agent becomes, the more twelve-factor the harness has to feel.
The Patches Were Evidence, Not The Spine
The previous version of this draft centered my three merged patches. That was honest, but structurally wrong. The patches are not the story. The story is the boundary. The patches are evidence that the boundary is real.
| PR | What it fixed | Boundary it exposed |
|---|---|---|
| #25 | The built-in bash tool advertised timeout but dropped it before SessionEnv.exec. | A tool schema is a promise to the model. If the runtime ignores it, the model cannot know the promise was broken. |
| #71 | Long flue run sessions hit Node/undici’s 300s idle timeout path. | ”Long-running agent” is not one feature. It is every timeout between caller, server, stream, and tool. |
| #102 | flue.config.ts added project config and model registration; later design moved provider registration toward runtime app code. | The lifecycle phase where config runs is part of the contract. Build-time and runtime secrets are different worlds. |
| #121 | Split @flue/runtime from @flue/cli, leaving @flue/sdk as a migration/client surface. | Package boundaries should match ownership boundaries: runtime code separate from build/dev tooling. |
| #130 | Added run registry and public/admin OpenAPI specs. | Headless agents need an inspection API because there is no human operator watching a TUI. |
The PRs are useful because each one shows where framework behavior has to hold across a runtime boundary.
That table is the source-level version of The Agent Loop Is a Lie. The tidy loop diagram hides the boundary work. Flue’s bugs and PRs are exactly where the tidy diagram stops being useful.
How To Read Flue Yourself
If you want to learn Flue deeply, do not start with the quickstart and stop there. The quickstart teaches the API. The source teaches the runtime.
Use this order:
- Read
README.mdfor the product contract. It tells you the intended mental model: Claude-Code-like, headless, TypeScript, runtime-agnostic. - Read
packages/runtime/src/harness.ts. Find whatinit()returns. Whatever owns sessions,fs, and the sandbox is the harness. - Read
packages/runtime/src/session.ts. This is the core. Watch howSessionHistorybecomes pi-agent-core messages, how roles/models/tools are scoped per call, and how child tasks are created. - Read
packages/runtime/src/agent.tswithsandbox.tsbeside it. Tool schemas live in one file; their runtime promises cross into the sandbox in the other. - Read
compaction.tsonly after the session tree makes sense. Otherwise compaction looks like summarization. It is actually tree surgery plus failure recovery. - Read
runtime/flue-app.tsand the CLI build plugins last. That is where “library” becomes “framework”: HTTP routes, run identity, Cloudflare/Node differences, OpenAPI, and generated entries.
README │ product promise ▼ harness.ts │ what init() returns ▼ session.ts │ tree, roles, model scoping, task sessions ▼ agent.ts + sandbox.ts │ tool contracts crossing into runtime reality ▼ compaction.ts │ threshold + overflow recovery ▼ flue-app.ts + build plugins │ deployment protocol, run identity, OpenAPI ▼ PR history where the abstractions were stress-tested
The Deep-Dive Series Split
This post should stay the hub. It answers a senior-engineer question: why is Flue shaped this way, and what makes the shape durable?
The deep-dive series should answer a different question: could you safely modify this subsystem?
That means this post should not explain every line of session.ts,
compaction.ts, sandbox.ts, or flue-app.ts. It should point to the
chapters that do. The draft series now lives in
src/content/deepDives/flue-framework/.
The split:
| Topic | Hub post job | Deep-dive chapter job |
|---|---|---|
| pi-ai layering | Name the bet: own harness, rent loop | Trace the call from Session.prompt() into new Agent(...) and provider resolution |
| Session tree | Explain why a tree is sticky | Walk SessionHistory methods, stores, task session metadata, and deletion |
| Compaction | Explain why it is architecture | Walk threshold, overflow, cut point selection, summary entries, usage accounting |
| Sandbox/tools | Explain schema-as-promise | Walk each built-in tool and how SessionEnv/SandboxApi enforce or reject it |
| Runtime API | Explain why headless needs inspection | Walk run store, run registry, /runs/:runId, OpenAPI, admin routes |
| Build targets | Explain framework, not library | Walk Node vs Cloudflare generated entries and deployment constraints |
The hub gives judgment. Each deep dive should earn one diagram, one source trace, and one failure mode.
That hub/spoke pattern is the same shape as the harness-engineering series and the production-agents series. The hub gives judgment. The series gives mechanics.
Key Takeaways
- Flue is interesting because of the seam. It rents pi-ai’s model loop and owns the harness layer around it.
- The domain language is the curriculum. Harness, session, active path, tool contract, provider seam, run, registry, sandbox, and build target should mean the same thing in every Flue training.
- The harness layer is the product. Sessions, tools, sandboxes, compaction, run identity, OpenAPI, and deployment targets are where production behavior accumulates.
- Twelve-factor pressure explains the constructs. Config, backing services, disposability, logs, admin processes, and build/release/run are why a headless agent framework needs more than a model loop.
- A session tree beats a transcript. Replay, compaction, task sessions, and deletion all become more legible when state is a tree with an active path.
- Headless agents need inspection APIs. PR #130 is a framework move: run IDs, registries, public/admin OpenAPI, and SDK scaffolding replace the missing human operator.
- Patches are evidence. PRs #25, #71, #102, #121, and #130 show the same lesson from different sides: features hold only when the runtime boundary holds.
Flue is young. That is good for learning. The code is still small enough to read, the boundaries are visible, and the PR history shows the design being pressure-tested in public. If your job is to build serious agents, that is the rare moment to study a framework: early enough to see the decisions, real enough that the decisions have consequences.
Sources
- withastro/flue README
- The Twelve-Factor App
- PR #25: honor bash tool timeout parameter end-to-end
- PR #71: keep SSE streams alive past Node 300s timeout
- PR #102: flue.config.ts with target / setup / models map
- PR #121: rename
@flue/sdkto@flue/runtimeand move build/dev tooling into@flue/cli - PR #130: run registry, admin API, and OpenAPI specs