I/D/E · Essay

Flue Under the Hood: Why This Agent Harness Holds

Summary

A source-level tour of Flue through domain-driven design and twelve-factor architecture: the language, boundaries, and runtime constructs behind the TypeScript agent harness framework.

Flue Under the Hood: Why This Agent Harness Holds

Senior engineers have seen enough frameworks die to mistrust the launch sentence. The interesting question is not “what does this framework do?” The interesting question is: what bet does it make that will still make sense after the API churn?

Flue calls itself “The Agent Harness Framework.” Its README says it is like Claude Code, but headless and programmable: no TUI, no GUI, just TypeScript. That sentence is a good hook, but it is not the reason Flue is interesting.

The reason is the boundary it chooses:

Flue owns the harness layer and rents the model loop.

That is the design decision that makes the framework worth studying. Flue does not try to become another model SDK. It lets @mariozechner/pi-ai and @mariozechner/pi-agent-core carry provider metadata, streaming semantics, tool-call execution, model catalogs, and the lower-level agent loop. Flue builds above that: sessions, skills, roles, sandboxes, compaction, run records, deployment targets, and the public HTTP surface.

That split is what turns “Claude Code, but headless” from a slogan into an architecture.

Flue owns the harness layer and rents the model loop

The model loop is the rented backend. The sticky framework surface is the harness control plane around it.

The loop is only one part of the framework

The model loop decides the next step. The harness owns the durable runtime behavior around that loop.

The Snapshot

I checked the repo through the GitHub API on 2026-05-16. The current main head I used is dbaa9eff, the merge commit for PR #130, “Add run registry, admin API, and OpenAPI specs.” The runtime package is @flue/runtime at version 0.5.3; @flue/sdk remains as a separate client/migration surface. The README still warns that Flue is experimental and APIs may change.

So treat every file path here as a versioned observation, not a permanent contract. The learning goal is the method: read the runtime boundary, then reason about features from that boundary.

The files that matter for the mental model:

FLUE RUNTIME MAP
.flue/agents/*.ts
       handler(ctx) calls ctx.init(...)
     

 Harness                                                
 packages/runtime/src/harness.ts                        
 owns sessions, fs, env, open child sessions            

                session()
               

 Session                                                
 packages/runtime/src/session.ts                        
 owns SessionHistory, prompt(), skill(), task(), shell() 

         wraps                uses
                             
     
 pi-agent-core         compaction.ts                   
 Agent                 threshold + overflow recovery   
     
         
         

 agent.ts + sandbox.ts                                  
 built-in tools, task tool, SessionEnv, SandboxApi       


deploy surface:

 cli build plugins + runtime/flue-app.ts                 
 Node, Cloudflare, run registry, /runs/:runId, OpenAPI   

That is the article in one diagram. Your agent file is thin. The harness is where the product lives.

The Ubiquitous Language

The best way to teach Flue is not as “a pile of TypeScript files.” It is a domain model for headless agent execution. Domain-driven design gives us the right move: define the language first, then read the code through that language.

Here is the vocabulary I would use across every Flue training, blog post, diagram, and deep dive:

TermMeaning in the Flue domainWhy it matters
Agent fileUser-authored TypeScript handler under .flue/agents.It declares the job, but should not own runtime mechanics.
InvocationA request to run an agent with an identity, input payload, and runtime context.It is the external command entering the domain.
HarnessThe aggregate root for a run-capable agent environment: sessions, filesystem, env, sandbox, and child sessions.It is where Flue owns behavior instead of delegating to the model SDK.
SessionA scoped execution conversation with tools, roles, model settings, and persisted history.It is the unit of agent work.
Session historyThe tree of entries behind a session, with an active path and leaf.It makes replay, compaction, task sessions, and deletion explainable.
Active pathThe path from root to current leaf that becomes the model-visible conversation.It separates stored history from runnable context.
Tool contractA schema plus runtime behavior that the model can rely on.It is not documentation; it is a promise the harness must keep.
Session environmentThe capability boundary exposed to tools: exec, file operations, sandbox behavior.It is where agent intent meets runtime reality.
Provider seamThe boundary where Flue resolves models, provider settings, API keys, and payload overrides before pi-ai executes.It lets Flue rent the model loop without surrendering harness semantics.
CompactionA state transition that replaces old active-path context with a summary entry.It is context management, replay safety, and failure recovery.
RunA concrete execution record produced by an invocation.It gives headless work an inspectable identity.
Run registryThe lookup surface that maps run IDs back to agent instances and stores.It replaces the missing human watching a terminal.
Build targetThe generated runtime shape for Node, Cloudflare, CI, or another host.It makes deployment part of the framework contract.
The language is the domain model

If a training uses these words consistently, the source code becomes much easier to read.

FLUE CONTEXT MAP
Agent Definition Context
.flue/agents/*.ts
      declares intent
     
Harness Context
init(), sessions, fs, env, child sessions
      opens work scope
     
Session Context
SessionHistory, active path, roles, compaction
      asks for model/tool steps
      Provider Context
                     pi-ai, pi-agent-core, provider payloads
     
      Execution Context
                     tools, SessionEnv, SandboxApi
     
      Observation Context
                      run store, run registry, events, OpenAPI

Deployment Context wraps the whole system:
Node, Cloudflare, CI, generated entries, bindings, config

This language matters because naming drift creates bad training. If “session” sometimes means transcript, sometimes run, and sometimes HTTP request, the reader cannot build a mental model. In Flue, those are different domain objects. A session owns conversation state. A run owns execution identity. A provider owns model transport. A sandbox owns runtime capabilities. The framework holds because those words do not collapse into one another.

The Bet: Own The Harness, Rent The Loop

Open packages/runtime/src/session.ts and the layering is explicit:

import { Agent } from '@mariozechner/pi-agent-core';
import type { Model, UserMessage, AssistantMessage } from '@mariozechner/pi-ai';

Later, Session constructs the lower-level Agent:

this.harness = new Agent({
  initialState: {
    systemPrompt,
    model: this.config.model,
    tools,
    messages: previousMessages,
    thinkingLevel: this.config.thinkingLevel ?? 'medium',
  },
  getApiKey: provider => this.getProviderApiKey(provider),
  onPayload: (payload, model) => this.applyProviderPayloadOverrides(payload, model),
  toolExecution: 'parallel',
  sessionId: options.affinityKey,
});

That code is the seam. Flue feeds pi-agent-core the current system prompt, model, tool set, message history, provider key resolver, and payload override hook. Then pi-agent-core runs the model loop.

What does Flue own around that loop?

LayerOwned by FlueRented from pi-ai / pi-agent-core
Model catalogresolveModel, provider registration, Cloudflare binding attachmentProvider/model shape, stream payload semantics
Conversation stateSessionHistory, stores, active path, compaction entriesMessage/content types consumed by providers
Tool surfacebuilt-in read/write/edit/bash/grep/glob/task, custom tools, connector toolsAgent tool execution lifecycle
Runtime boundarySessionEnv, SandboxApi, Node/Cloudflare/local sandbox adaptersThe model deciding when to call a tool
Deployment surfaceCLI build plugins, flue(), run store, run registry, OpenAPINothing; this is Flue’s framework layer
Provider settings cross the seam; ownership does not

Flue can set transport options such as provider retention while keeping session memory as the harness-owned record.

This is the same build-vs-buy answer I argued for in the harness-engineering series: own the interfaces that encode your system’s semantics; rent the replaceable backends. Models change. Provider APIs change. But sessions, tools, sandboxes, compaction, and run history are the control plane. That is where product behavior accumulates.

The best example is the OpenAI Responses store flag. Flue can expose a provider setting like storeResponses without pretending that hosted provider memory is the same thing as Flue session memory. The payload hook is deliberately at the pi-ai seam:

if (settings?.storeResponses === true) {
  return { ...(payload as Record<string, unknown>), store: true };
}

That is small code, but it teaches the boundary: provider retention is a transport setting; Flue’s session tree remains the harness record.

Flue vs. A Claude-Code-Style Harness

Flue invites the comparison itself: “like Claude Code, but 100% headless and programmable.” The useful way to read that sentence is not “Flue is Claude Code.” It is: Flue extracted the harness pattern and made different product choices.

CLAUDE-CODE-STYLE HARNESS VS FLUE
Shared harness primitives

 Markdown context   AGENTS.md, skills, roles, instructions     
 Filesystem         read/write/edit/shell as agent affordance  
 Delegation         task/subagent shape, child context         
 Tool loop          model chooses actions, runtime executes    


Product split

 Claude-Code-style harness    Flue                             

 human operator in the loop   app/server/CI invokes it         
 terminal-first workflow      TypeScript-first runtime         
 local worktree default       Node, Cloudflare, CI, connectors 
 conversation is the product  agent endpoint is the product    
 inspect by watching          inspect through run APIs/logs    

The difference matters. Interactive coding agents can rely on a human operator as part of the runtime. Flue cannot. A support agent on Cloudflare, a CI triage agent, or a remote coding agent invoked over HTTP needs a different surface:

  • A stable request route: /agents/:name/:id
  • A stable run identity: /runs/:runId
  • A run registry so the caller does not need to know the owning agent and instance for later lookup
  • A public OpenAPI spec and a mountable admin API
  • Sandboxes that can be virtual, local, Cloudflare-backed, or remote
Headless means inspection has to become an API

Interactive tools can rely on a person watching the transcript. Flue has to make run lookup, events, and streams first-class.

That is why PR #130 matters. It is not just “an admin API.” It is Flue continuing the product split: headless agents need a runtime inspection plane because there is no TUI operator staring at the transcript.

Three Decisions That Make It Stick

The Flue codebase is young, but the durable design is already visible. Three choices are load-bearing.

1. A session is a tree, not a transcript

In SessionHistory, the current state is the active path to a leaf:

getLeafId(): string | null
getActivePath(): SessionEntry[]
getActivePathSince(afterLeafId: string | null): SessionEntry[]

That sounds like implementation detail until you ask what replay, compaction, and child tasks can safely do. A transcript is append-only. A tree gives you branches, active paths, compaction entries, task session metadata, and deletion of subtrees.

The user-facing lesson is simple: Flue does not merely store messages. It stores a navigable execution history. That is the difference between a chat wrapper and a harness.

Session tree plus compaction recovery

A flat transcript can only grow. A tree lets Flue remove the unsafe tail, append a summary node, and retry from a safe active path.

When the deep-dive series exists, this section should point to chapter 1: session.ts, the active path, and replay safety.

2. Compaction is an algorithm, not a prompt trick

packages/runtime/src/compaction.ts splits the work into named stages:

deriveCompactionDefaults(...)
calculateContextTokens(...)
shouldCompact(...)
prepareCompaction(...)
compact(...)

The session calls it in two modes:

  • threshold compaction, when context approaches contextWindow - reserve
  • overflow recovery, when the provider reports a context overflow, Flue removes the failed assistant leaf, compacts, and retries
Compaction is a runtime state transition

The important construct is not summarization alone. It is measuring, cutting, appending a summary entry, and recovering from overflow.

That second path is the important one. It means compaction is not just a cost optimization. It is failure recovery. The session tree makes that recovery possible because Flue can remove the failed leaf and re-derive the active context.

When the deep-dive series exists, this section should point to chapter 3: compaction.ts, cut points, summarization cost, and overflow retry.

3. Deployment is part of the framework, not an example folder

Flue’s runtime is not only prompt(). It has build plugins, generated entry points, default Hono apps, Cloudflare Durable Objects, run stores, run registries, and OpenAPI documents. On current main, packages/runtime/src/runtime/flue-app.ts mounts:

GET  /openapi.json
POST /agents/:name/:id
GET  /runs/:runId
GET  /runs/:runId/events
GET  /runs/:runId/stream
One agent source, multiple runtime shapes

Deployment is part of the framework because the host changes what adapters, stores, and bindings have to exist.

This is what “framework” means. The framework is allowed to care about where the agent runs, how you inspect it later, and what stable protocol callers use. That is not incidental. It is the product.

When the deep-dive series exists, this section should point to chapter 5: the run store, run registry, public API, admin API, and reconnectable run streams.

The Twelve-Factor Reading

The original Twelve-Factor App was written for software-as-a-service applications, not agent harnesses. Still, the lens is useful if we translate it carefully. Flue is not just modeling an agent loop. It is modeling a deployable, inspectable, configurable runtime.

FactorFlue training narrationConstruct to study
CodebaseOne agent source should build into multiple runtime shapes.Agent files, runtime package, CLI package, generated entries
DependenciesRented layers stay explicit: pi-ai and pi-agent-core are dependencies, not hidden copies.@flue/runtime, @flue/cli, package boundaries
ConfigProvider keys, bindings, target choices, and model settings belong at runtime boundaries.provider registration, getApiKey, Cloudflare bindings, app composition
Backing servicesModels, stores, sandboxes, registries, and remote connectors are attached resources.provider seam, stores, SandboxApi, run registry
Build, release, runBuilding an agent is not the same thing as invoking an agent.CLI build plugins, generated Node/Cloudflare entries, runtime app
ProcessesA running handler should not be the only memory of the work.session stores, run stores, run IDs, active paths
Port bindingHeadless agents need a service surface, not a terminal transcript.Hono app, /agents/:name/:id, /runs/:runId, OpenAPI
ConcurrencyParallel tool execution and multiple runs must preserve scoped state.toolExecution: 'parallel', sessions, run identity
DisposabilityLong-running agents need timeout, retry, and recovery semantics.bash timeout propagation, stream handling, overflow compaction
Dev/prod parityLocal, CI, Node, and Cloudflare should teach the same harness concepts.build targets, adapters, local and remote sandbox modes
LogsA headless run needs event streams and replayable history.run events, stream routes, run store, run registry
Admin processesInspection and maintenance should be first-class runtime operations.admin API, one-off runs, registry lookup, generated SDK surface
Twelve-factor pressure explains the framework surface

The more headless the agent becomes, the more the harness has to behave like a cloud application.

This is the richer narration: Flue’s constructs are not accidental. They are the agent-harness version of cloud-native pressure. Config has to move out of the code path. Services have to be attachable. Runs need durable identity. Logs have to become event streams. Admin inspection cannot be an afterthought. The more headless the agent becomes, the more twelve-factor the harness has to feel.

The Patches Were Evidence, Not The Spine

The previous version of this draft centered my three merged patches. That was honest, but structurally wrong. The patches are not the story. The story is the boundary. The patches are evidence that the boundary is real.

PRWhat it fixedBoundary it exposed
#25The built-in bash tool advertised timeout but dropped it before SessionEnv.exec.A tool schema is a promise to the model. If the runtime ignores it, the model cannot know the promise was broken.
#71Long flue run sessions hit Node/undici’s 300s idle timeout path.”Long-running agent” is not one feature. It is every timeout between caller, server, stream, and tool.
#102flue.config.ts added project config and model registration; later design moved provider registration toward runtime app code.The lifecycle phase where config runs is part of the contract. Build-time and runtime secrets are different worlds.
#121Split @flue/runtime from @flue/cli, leaving @flue/sdk as a migration/client surface.Package boundaries should match ownership boundaries: runtime code separate from build/dev tooling.
#130Added run registry and public/admin OpenAPI specs.Headless agents need an inspection API because there is no human operator watching a TUI.
Patches reveal the boundaries under stress

The PRs are useful because each one shows where framework behavior has to hold across a runtime boundary.

That table is the source-level version of The Agent Loop Is a Lie. The tidy loop diagram hides the boundary work. Flue’s bugs and PRs are exactly where the tidy diagram stops being useful.

How To Read Flue Yourself

If you want to learn Flue deeply, do not start with the quickstart and stop there. The quickstart teaches the API. The source teaches the runtime.

Use this order:

  1. Read README.md for the product contract. It tells you the intended mental model: Claude-Code-like, headless, TypeScript, runtime-agnostic.
  2. Read packages/runtime/src/harness.ts. Find what init() returns. Whatever owns sessions, fs, and the sandbox is the harness.
  3. Read packages/runtime/src/session.ts. This is the core. Watch how SessionHistory becomes pi-agent-core messages, how roles/models/tools are scoped per call, and how child tasks are created.
  4. Read packages/runtime/src/agent.ts with sandbox.ts beside it. Tool schemas live in one file; their runtime promises cross into the sandbox in the other.
  5. Read compaction.ts only after the session tree makes sense. Otherwise compaction looks like summarization. It is actually tree surgery plus failure recovery.
  6. Read runtime/flue-app.ts and the CLI build plugins last. That is where “library” becomes “framework”: HTTP routes, run identity, Cloudflare/Node differences, OpenAPI, and generated entries.
READING ORDER
README
    product promise
  
harness.ts
    what init() returns
  
session.ts
    tree, roles, model scoping, task sessions
  
agent.ts  +  sandbox.ts
    tool contracts crossing into runtime reality
  
compaction.ts
    threshold + overflow recovery
  
flue-app.ts + build plugins
    deployment protocol, run identity, OpenAPI
  
PR history
     where the abstractions were stress-tested

The Deep-Dive Series Split

This post should stay the hub. It answers a senior-engineer question: why is Flue shaped this way, and what makes the shape durable?

The deep-dive series should answer a different question: could you safely modify this subsystem?

That means this post should not explain every line of session.ts, compaction.ts, sandbox.ts, or flue-app.ts. It should point to the chapters that do. The draft series now lives in src/content/deepDives/flue-framework/.

The split:

TopicHub post jobDeep-dive chapter job
pi-ai layeringName the bet: own harness, rent loopTrace the call from Session.prompt() into new Agent(...) and provider resolution
Session treeExplain why a tree is stickyWalk SessionHistory methods, stores, task session metadata, and deletion
CompactionExplain why it is architectureWalk threshold, overflow, cut point selection, summary entries, usage accounting
Sandbox/toolsExplain schema-as-promiseWalk each built-in tool and how SessionEnv/SandboxApi enforce or reject it
Runtime APIExplain why headless needs inspectionWalk run store, run registry, /runs/:runId, OpenAPI, admin routes
Build targetsExplain framework, not libraryWalk Node vs Cloudflare generated entries and deployment constraints
The hub/spoke learning architecture

The hub gives judgment. Each deep dive should earn one diagram, one source trace, and one failure mode.

That hub/spoke pattern is the same shape as the harness-engineering series and the production-agents series. The hub gives judgment. The series gives mechanics.

Key Takeaways

  • Flue is interesting because of the seam. It rents pi-ai’s model loop and owns the harness layer around it.
  • The domain language is the curriculum. Harness, session, active path, tool contract, provider seam, run, registry, sandbox, and build target should mean the same thing in every Flue training.
  • The harness layer is the product. Sessions, tools, sandboxes, compaction, run identity, OpenAPI, and deployment targets are where production behavior accumulates.
  • Twelve-factor pressure explains the constructs. Config, backing services, disposability, logs, admin processes, and build/release/run are why a headless agent framework needs more than a model loop.
  • A session tree beats a transcript. Replay, compaction, task sessions, and deletion all become more legible when state is a tree with an active path.
  • Headless agents need inspection APIs. PR #130 is a framework move: run IDs, registries, public/admin OpenAPI, and SDK scaffolding replace the missing human operator.
  • Patches are evidence. PRs #25, #71, #102, #121, and #130 show the same lesson from different sides: features hold only when the runtime boundary holds.

Flue is young. That is good for learning. The code is still small enough to read, the boundaries are visible, and the PR history shows the design being pressure-tested in public. If your job is to build serious agents, that is the rare moment to study a framework: early enough to see the decisions, real enough that the decisions have consequences.

Sources