Flue Under the Hood: Why This Agent Harness Holds

Senior engineers have seen enough frameworks die to mistrust the launch sentence. The interesting question is not “what does this framework do?” The interesting question is: what bet does it make that will still make sense after the API churn?

Flue calls itself “The Agent Harness Framework.” Its README says it is like Claude Code, but headless and programmable: no TUI, no GUI, just TypeScript. That sentence is a good hook, but it is not the reason Flue is interesting.

The reason is the boundary it chooses:

Flue owns the harness layer and rents the model loop.

That is the design decision that makes the framework worth studying. Flue does not try to become another model SDK. It lets @mariozechner/pi-ai and @mariozechner/pi-agent-core carry provider metadata, streaming semantics, tool-call execution, model catalogs, and the lower-level agent loop. Flue builds above that: sessions, skills, roles, sandboxes, compaction, run records, deployment targets, and the public HTTP surface.

That split is what turns “Claude Code, but headless” from a slogan into an architecture.

Animated Excalidraw-style architecture sketch showing a thin agent file flowing into the Flue harness control plane, where sessions, tools, sandboxing, compaction, run identity, OpenAPI, and deployment targets are owned while the model loop is delegated to pi-ai and pi-agent-core. — Flue owns the harness layer and rents the model loop

Animated diagram showing the model loop inside the Flue harness: prompt, model, decision, tool execution, observation, and result. — The loop is only one part of the framework

The Snapshot

I checked the repo through the GitHub API on 2026-05-16. The current main head I used is dbaa9eff, the merge commit for PR #130, “Add run registry, admin API, and OpenAPI specs.” The runtime package is @flue/runtime at version 0.5.3; @flue/sdk remains as a separate client/migration surface. The README still warns that Flue is experimental and APIs may change.

So treat every file path here as a versioned observation, not a permanent contract. The learning goal is the method: read the runtime boundary, then reason about features from that boundary.

The files that matter for the mental model:

FLUE RUNTIME MAP

.flue/agents/*.ts
     │  handler(ctx) calls ctx.init(...)
     ▼
┌────────────────────────────────────────────────────────┐
│ Harness                                                │
│ packages/runtime/src/harness.ts                        │
│ owns sessions, fs, env, open child sessions            │
└──────────────┬─────────────────────────────────────────┘
               │ session()
               ▼
┌────────────────────────────────────────────────────────┐
│ Session                                                │
│ packages/runtime/src/session.ts                        │
│ owns SessionHistory, prompt(), skill(), task(), shell() │
└───────┬─────────────────────┬──────────────────────────┘
        │ wraps               │ uses
        ▼                     ▼
┌─────────────────┐     ┌────────────────────────────────┐
│ pi-agent-core   │     │ compaction.ts                   │
│ Agent           │     │ threshold + overflow recovery   │
└────────┬────────┘     └────────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ agent.ts + sandbox.ts                                  │
│ built-in tools, task tool, SessionEnv, SandboxApi       │
└────────────────────────────────────────────────────────┘

deploy surface:
┌────────────────────────────────────────────────────────┐
│ cli build plugins + runtime/flue-app.ts                 │
│ Node, Cloudflare, run registry, /runs/:runId, OpenAPI   │
└────────────────────────────────────────────────────────┘

That is the article in one diagram. Your agent file is thin. The harness is where the product lives.

The Ubiquitous Language

The best way to teach Flue is not as “a pile of TypeScript files.” It is a domain model for headless agent execution. Domain-driven design gives us the right move: define the language first, then read the code through that language.

Here is the vocabulary I would use across every Flue training, blog post, diagram, and deep dive:

Term	Meaning in the Flue domain	Why it matters
Agent file	User-authored TypeScript handler under `.flue/agents`.	It declares the job, but should not own runtime mechanics.
Invocation	A request to run an agent with an identity, input payload, and runtime context.	It is the external command entering the domain.
Harness	The aggregate root for a run-capable agent environment: sessions, filesystem, env, sandbox, and child sessions.	It is where Flue owns behavior instead of delegating to the model SDK.
Session	A scoped execution conversation with tools, roles, model settings, and persisted history.	It is the unit of agent work.
Session history	The tree of entries behind a session, with an active path and leaf.	It makes replay, compaction, task sessions, and deletion explainable.
Active path	The path from root to current leaf that becomes the model-visible conversation.	It separates stored history from runnable context.
Tool contract	A schema plus runtime behavior that the model can rely on.	It is not documentation; it is a promise the harness must keep.
Session environment	The capability boundary exposed to tools: exec, file operations, sandbox behavior.	It is where agent intent meets runtime reality.
Provider seam	The boundary where Flue resolves models, provider settings, API keys, and payload overrides before pi-ai executes.	It lets Flue rent the model loop without surrendering harness semantics.
Compaction	A state transition that replaces old active-path context with a summary entry.	It is context management, replay safety, and failure recovery.
Run	A concrete execution record produced by an invocation.	It gives headless work an inspectable identity.
Run registry	The lookup surface that maps run IDs back to agent instances and stores.	It replaces the missing human watching a terminal.
Build target	The generated runtime shape for Node, Cloudflare, CI, or another host.	It makes deployment part of the framework contract.

Bounded-context map showing Flue domain terms: invocation, harness, session, history, tools, environment, provider seam, and run registry. — The language is the domain model

FLUE CONTEXT MAP

Agent Definition Context
.flue/agents/*.ts
     │ declares intent
     ▼
Harness Context
init(), sessions, fs, env, child sessions
     │ opens work scope
     ▼
Session Context
SessionHistory, active path, roles, compaction
     │ asks for model/tool steps
     ├──────────────► Provider Context
     │                pi-ai, pi-agent-core, provider payloads
     │
     ├──────────────► Execution Context
     │                tools, SessionEnv, SandboxApi
     │
     └──────────────► Observation Context
                      run store, run registry, events, OpenAPI

Deployment Context wraps the whole system:
Node, Cloudflare, CI, generated entries, bindings, config

This language matters because naming drift creates bad training. If “session” sometimes means transcript, sometimes run, and sometimes HTTP request, the reader cannot build a mental model. In Flue, those are different domain objects. A session owns conversation state. A run owns execution identity. A provider owns model transport. A sandbox owns runtime capabilities. The framework holds because those words do not collapse into one another.

The Bet: Own The Harness, Rent The Loop

Open packages/runtime/src/session.ts and the layering is explicit:

import { Agent } from '@mariozechner/pi-agent-core';
import type { Model, UserMessage, AssistantMessage } from '@mariozechner/pi-ai';

Later, Session constructs the lower-level Agent:

this.harness = new Agent({
  initialState: {
    systemPrompt,
    model: this.config.model,
    tools,
    messages: previousMessages,
    thinkingLevel: this.config.thinkingLevel ?? 'medium',
  },
  getApiKey: provider => this.getProviderApiKey(provider),
  onPayload: (payload, model) => this.applyProviderPayloadOverrides(payload, model),
  toolExecution: 'parallel',
  sessionId: options.affinityKey,
});

That code is the seam. Flue feeds pi-agent-core the current system prompt, model, tool set, message history, provider key resolver, and payload override hook. Then pi-agent-core runs the model loop.

What does Flue own around that loop?

Layer	Owned by Flue	Rented from pi-ai / pi-agent-core
Model catalog	`resolveModel`, provider registration, Cloudflare binding attachment	Provider/model shape, stream payload semantics
Conversation state	`SessionHistory`, stores, active path, compaction entries	Message/content types consumed by providers
Tool surface	built-in `read/write/edit/bash/grep/glob/task`, custom tools, connector tools	Agent tool execution lifecycle
Runtime boundary	`SessionEnv`, `SandboxApi`, Node/Cloudflare/local sandbox adapters	The model deciding when to call a tool
Deployment surface	CLI build plugins, `flue()`, run store, run registry, OpenAPI	Nothing; this is Flue’s framework layer

Animated provider seam diagram showing Flue session memory and tool contracts on one side, pi-agent-core and pi-ai provider transport on the other, and provider payload options crossing the seam. — Provider settings cross the seam; ownership does not

This is the same build-vs-buy answer I argued for in the harness-engineering series: own the interfaces that encode your system’s semantics; rent the replaceable backends. Models change. Provider APIs change. But sessions, tools, sandboxes, compaction, and run history are the control plane. That is where product behavior accumulates.

The best example is the OpenAI Responses store flag. Flue can expose a provider setting like storeResponses without pretending that hosted provider memory is the same thing as Flue session memory. The payload hook is deliberately at the pi-ai seam:

if (settings?.storeResponses === true) {
  return { ...(payload as Record<string, unknown>), store: true };
}

That is small code, but it teaches the boundary: provider retention is a transport setting; Flue’s session tree remains the harness record.

Flue vs. A Claude-Code-Style Harness

Flue invites the comparison itself: “like Claude Code, but 100% headless and programmable.” The useful way to read that sentence is not “Flue is Claude Code.” It is: Flue extracted the harness pattern and made different product choices.

CLAUDE-CODE-STYLE HARNESS VS FLUE

Shared harness primitives
┌───────────────────┬────────────────────────────────────────────┐
│ Markdown context  │ AGENTS.md, skills, roles, instructions     │
│ Filesystem        │ read/write/edit/shell as agent affordance  │
│ Delegation        │ task/subagent shape, child context         │
│ Tool loop         │ model chooses actions, runtime executes    │
└───────────────────┴────────────────────────────────────────────┘

Product split
┌─────────────────────────────┬──────────────────────────────────┐
│ Claude-Code-style harness   │ Flue                             │
├─────────────────────────────┼──────────────────────────────────┤
│ human operator in the loop  │ app/server/CI invokes it         │
│ terminal-first workflow     │ TypeScript-first runtime         │
│ local worktree default      │ Node, Cloudflare, CI, connectors │
│ conversation is the product │ agent endpoint is the product    │
│ inspect by watching         │ inspect through run APIs/logs    │
└─────────────────────────────┴──────────────────────────────────┘

The difference matters. Interactive coding agents can rely on a human operator as part of the runtime. Flue cannot. A support agent on Cloudflare, a CI triage agent, or a remote coding agent invoked over HTTP needs a different surface:

A stable request route: /agents/:name/:id
A stable run identity: /runs/:runId
A run registry so the caller does not need to know the owning agent and instance for later lookup
A public OpenAPI spec and a mountable admin API
Sandboxes that can be virtual, local, Cloudflare-backed, or remote

Comparison diagram showing an interactive terminal harness with a human operator versus a Flue headless runtime with run IDs, registry, events, streams, and OpenAPI routes. — Headless means inspection has to become an API

That is why PR #130 matters. It is not just “an admin API.” It is Flue continuing the product split: headless agents need a runtime inspection plane because there is no TUI operator staring at the transcript.

Three Decisions That Make It Stick

The Flue codebase is young, but the durable design is already visible. Three choices are load-bearing.

1. A session is a tree, not a transcript

In SessionHistory, the current state is the active path to a leaf:

getLeafId(): string | null
getActivePath(): SessionEntry[]
getActivePathSince(afterLeafId: string | null): SessionEntry[]

That sounds like implementation detail until you ask what replay, compaction, and child tasks can safely do. A transcript is append-only. A tree gives you branches, active paths, compaction entries, task session metadata, and deletion of subtrees.

The user-facing lesson is simple: Flue does not merely store messages. It stores a navigable execution history. That is the difference between a chat wrapper and a harness.

Animated Excalidraw-style diagram showing a Flue session tree where an overflowed assistant leaf is removed, a compaction summary is appended, and a retry continues from the compacted active path. — Session tree plus compaction recovery

When the deep-dive series exists, this section should point to chapter 1: session.ts, the active path, and replay safety.

2. Compaction is an algorithm, not a prompt trick

packages/runtime/src/compaction.ts splits the work into named stages:

deriveCompactionDefaults(...)
calculateContextTokens(...)
shouldCompact(...)
prepareCompaction(...)
compact(...)

The session calls it in two modes:

threshold compaction, when context approaches contextWindow - reserve
overflow recovery, when the provider reports a context overflow, Flue removes the failed assistant leaf, compacts, and retries

Animated flow showing Flue compaction as threshold measurement, cut point preparation, summary entry creation, overflow recovery, and retry. — Compaction is a runtime state transition

That second path is the important one. It means compaction is not just a cost optimization. It is failure recovery. The session tree makes that recovery possible because Flue can remove the failed leaf and re-derive the active context.

When the deep-dive series exists, this section should point to chapter 3: compaction.ts, cut points, summarization cost, and overflow retry.

3. Deployment is part of the framework, not an example folder

Flue’s runtime is not only prompt(). It has build plugins, generated entry points, default Hono apps, Cloudflare Durable Objects, run stores, run registries, and OpenAPI documents. On current main, packages/runtime/src/runtime/flue-app.ts mounts:

GET  /openapi.json
POST /agents/:name/:id
GET  /runs/:runId
GET  /runs/:runId/events
GET  /runs/:runId/stream

Animated diagram showing one Flue agent source building into Node, Cloudflare, CI, and remote sandbox targets while preserving the same harness vocabulary. — One agent source, multiple runtime shapes

This is what “framework” means. The framework is allowed to care about where the agent runs, how you inspect it later, and what stable protocol callers use. That is not incidental. It is the product.

When the deep-dive series exists, this section should point to chapter 5: the run store, run registry, public API, admin API, and reconnectable run streams.

The Twelve-Factor Reading

The original Twelve-Factor App was written for software-as-a-service applications, not agent harnesses. Still, the lens is useful if we translate it carefully. Flue is not just modeling an agent loop. It is modeling a deployable, inspectable, configurable runtime.

Factor	Flue training narration	Construct to study
Codebase	One agent source should build into multiple runtime shapes.	Agent files, runtime package, CLI package, generated entries
Dependencies	Rented layers stay explicit: pi-ai and pi-agent-core are dependencies, not hidden copies.	`@flue/runtime`, `@flue/cli`, package boundaries
Config	Provider keys, bindings, target choices, and model settings belong at runtime boundaries.	provider registration, `getApiKey`, Cloudflare bindings, app composition
Backing services	Models, stores, sandboxes, registries, and remote connectors are attached resources.	provider seam, stores, `SandboxApi`, run registry
Build, release, run	Building an agent is not the same thing as invoking an agent.	CLI build plugins, generated Node/Cloudflare entries, runtime app
Processes	A running handler should not be the only memory of the work.	session stores, run stores, run IDs, active paths
Port binding	Headless agents need a service surface, not a terminal transcript.	Hono app, `/agents/:name/:id`, `/runs/:runId`, OpenAPI
Concurrency	Parallel tool execution and multiple runs must preserve scoped state.	`toolExecution: 'parallel'`, sessions, run identity
Disposability	Long-running agents need timeout, retry, and recovery semantics.	bash timeout propagation, stream handling, overflow compaction
Dev/prod parity	Local, CI, Node, and Cloudflare should teach the same harness concepts.	build targets, adapters, local and remote sandbox modes
Logs	A headless run needs event streams and replayable history.	run events, stream routes, run store, run registry
Admin processes	Inspection and maintenance should be first-class runtime operations.	admin API, one-off runs, registry lookup, generated SDK surface

Grouped twelve-factor diagram mapping source, configuration, runtime, and operations pressure to Flue constructs such as provider keys, bindings, run IDs, stores, logs, and admin APIs. — Twelve-factor pressure explains the framework surface

This is the richer narration: Flue’s constructs are not accidental. They are the agent-harness version of cloud-native pressure. Config has to move out of the code path. Services have to be attachable. Runs need durable identity. Logs have to become event streams. Admin inspection cannot be an afterthought. The more headless the agent becomes, the more twelve-factor the harness has to feel.

The Patches Were Evidence, Not The Spine

The previous version of this draft centered my three merged patches. That was honest, but structurally wrong. The patches are not the story. The story is the boundary. The patches are evidence that the boundary is real.

PR	What it fixed	Boundary it exposed
#25	The built-in `bash` tool advertised `timeout` but dropped it before `SessionEnv.exec`.	A tool schema is a promise to the model. If the runtime ignores it, the model cannot know the promise was broken.
#71	Long `flue run` sessions hit Node/undici’s 300s idle timeout path.	”Long-running agent” is not one feature. It is every timeout between caller, server, stream, and tool.
#102	`flue.config.ts` added project config and model registration; later design moved provider registration toward runtime app code.	The lifecycle phase where config runs is part of the contract. Build-time and runtime secrets are different worlds.
#121	Split `@flue/runtime` from `@flue/cli`, leaving `@flue/sdk` as a migration/client surface.	Package boundaries should match ownership boundaries: runtime code separate from build/dev tooling.
#130	Added run registry and public/admin OpenAPI specs.	Headless agents need an inspection API because there is no human operator watching a TUI.

Animated diagram showing PRs 25, 71, 102, 121, and 130 as stress tests against Flue harness boundaries. — Patches reveal the boundaries under stress

That table is the source-level version of The Agent Loop Is a Lie. The tidy loop diagram hides the boundary work. Flue’s bugs and PRs are exactly where the tidy diagram stops being useful.

How To Read Flue Yourself

If you want to learn Flue deeply, do not start with the quickstart and stop there. The quickstart teaches the API. The source teaches the runtime.

Use this order:

Read README.md for the product contract. It tells you the intended mental model: Claude-Code-like, headless, TypeScript, runtime-agnostic.
Read packages/runtime/src/harness.ts. Find what init() returns. Whatever owns sessions, fs, and the sandbox is the harness.
Read packages/runtime/src/session.ts. This is the core. Watch how SessionHistory becomes pi-agent-core messages, how roles/models/tools are scoped per call, and how child tasks are created.
Read packages/runtime/src/agent.ts with sandbox.ts beside it. Tool schemas live in one file; their runtime promises cross into the sandbox in the other.
Read compaction.ts only after the session tree makes sense. Otherwise compaction looks like summarization. It is actually tree surgery plus failure recovery.
Read runtime/flue-app.ts and the CLI build plugins last. That is where “library” becomes “framework”: HTTP routes, run identity, Cloudflare/Node differences, OpenAPI, and generated entries.

READING ORDER

README
  │  product promise
  ▼
harness.ts
  │  what init() returns
  ▼
session.ts
  │  tree, roles, model scoping, task sessions
  ▼
agent.ts  +  sandbox.ts
  │  tool contracts crossing into runtime reality
  ▼
compaction.ts
  │  threshold + overflow recovery
  ▼
flue-app.ts + build plugins
  │  deployment protocol, run identity, OpenAPI
  ▼
PR history
     where the abstractions were stress-tested

The Deep-Dive Series Split

This post should stay the hub. It answers a senior-engineer question: why is Flue shaped this way, and what makes the shape durable?

The deep-dive series should answer a different question: could you safely modify this subsystem?

That means this post should not explain every line of session.ts, compaction.ts, sandbox.ts, or flue-app.ts. It should point to the chapters that do. The draft series now lives in src/content/deepDives/flue-framework/.

The split:

Topic	Hub post job	Deep-dive chapter job
pi-ai layering	Name the bet: own harness, rent loop	Trace the call from `Session.prompt()` into `new Agent(...)` and provider resolution
Session tree	Explain why a tree is sticky	Walk `SessionHistory` methods, stores, task session metadata, and deletion
Compaction	Explain why it is architecture	Walk threshold, overflow, cut point selection, summary entries, usage accounting
Sandbox/tools	Explain schema-as-promise	Walk each built-in tool and how `SessionEnv`/`SandboxApi` enforce or reject it
Runtime API	Explain why headless needs inspection	Walk run store, run registry, `/runs/:runId`, OpenAPI, admin routes
Build targets	Explain framework, not library	Walk Node vs Cloudflare generated entries and deployment constraints

Hub-and-spoke diagram showing the Flue hub post in the center and deep-dive chapters for provider layering, session tree, compaction, sandbox tools, runtime API, and build targets. — The hub/spoke learning architecture

That hub/spoke pattern is the same shape as the harness-engineering series and the production-agents series. The hub gives judgment. The series gives mechanics.

Key Takeaways

Flue is interesting because of the seam. It rents pi-ai’s model loop and owns the harness layer around it.
The domain language is the curriculum. Harness, session, active path, tool contract, provider seam, run, registry, sandbox, and build target should mean the same thing in every Flue training.
The harness layer is the product. Sessions, tools, sandboxes, compaction, run identity, OpenAPI, and deployment targets are where production behavior accumulates.
Twelve-factor pressure explains the constructs. Config, backing services, disposability, logs, admin processes, and build/release/run are why a headless agent framework needs more than a model loop.
A session tree beats a transcript. Replay, compaction, task sessions, and deletion all become more legible when state is a tree with an active path.
Headless agents need inspection APIs. PR #130 is a framework move: run IDs, registries, public/admin OpenAPI, and SDK scaffolding replace the missing human operator.
Patches are evidence. PRs #25, #71, #102, #121, and #130 show the same lesson from different sides: features hold only when the runtime boundary holds.

Flue is young. That is good for learning. The code is still small enough to read, the boundaries are visible, and the PR history shows the design being pressure-tested in public. If your job is to build serious agents, that is the rare moment to study a framework: early enough to see the decisions, real enough that the decisions have consequences.