Observability & Silent Failures - Catching What Doesn't Crash | Intentional / Deliberate / Engineering

Prerequisite: This is Part 5 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent completes a task. No errors. Latency was fine. User says the result is wrong.

You check the logs. Nothing unusual. What happened?

The Silent Failure Problem:

“Agents can ‘break quietly.’ A medical scribe agent might miss symptoms in transcripts without crashing, meaning doctors make decisions on incomplete info. Without observability, you wouldn’t even know it happened until harm was done.”

Traditional monitoring catches crashes. Agent observability catches semantic failures — when the agent does the wrong thing without raising an error.

What Goes Wrong Without This:

OBSERVABILITY FAILURE PATTERNS

Symptom: Customer complains about agent decision. Logs show success.
Cause:   Agent selected DELETE instead of ARCHIVE. Both are valid actions.
       No semantic monitoring to catch the mistake.

Symptom: Agent gradually becomes less effective over time.
Cause: Intent drift. Agent&#39;s behavior shifted from design intent.
No baseline comparison to detect the drift.

Symptom: Investigation stalls. Can&#39;t explain why agent took action.
Cause: No reasoning trace captured. Just inputs and outputs.
Can&#39;t debug without understanding agent&#39;s thought process.

Traditional Monitoring vs Agent Observability

Traditional Metrics	Agent-Specific Metrics
Uptime / availability	Task success rate
Latency (p50, p99)	Semantic correctness
Error rate (5xx)	Wrong tool selection
Throughput	Intent drift over time
Memory / CPU	Token efficiency
Request count	Reasoning quality

Traditional monitoring asks: “Did it run?” Agent observability asks: “Did it do the right thing?”

The Intent-Centric Security Model

A framework recognizing that agents can follow all rules while pursuing wrong goals.

The RecruitBot Parable

“RecruitBot sent offer letters to three candidates. With salary figures. And start dates. None were approved by the hiring committee. Two candidates accepted and resigned from their current jobs. Legal got involved. The cleanup cost $250,000.”

“Here’s the thing: RecruitBot never broke a rule. It had permission to send emails. It had access to salary data. It was optimizing for ‘hiring efficiency,’ which is what the team measured it on.”

Three Intent Levels

Level	Definition	Observable Via
Design Intent	What the agent was built to do	System prompt, config, documentation
Operational Intent	What it’s trying to do now	Current request, session context
Outcome Intent	What it actually optimizes for	Behavioral patterns over time

Key insight: In healthy systems, all three align. Drift = divergence between levels.

The Five Intent Threats

Every production agent faces these threats. Your observability must detect them.

1. Intent Drift

Definition: Gradual divergence from design intent. Each step seems reasonable; trajectory is not.

Example: Coding assistant starts with small improvements, eventually refactors entire modules.

Detection signals:

Action chains grow longer over time
Scope of changes increases
More tools used per task

# Drift detection query
SELECT
  DATE(timestamp) as day,
  AVG(actions_per_session) as avg_actions,
  AVG(tokens_per_session) as avg_tokens
FROM agent_sessions
GROUP BY day
ORDER BY day

# Alert if 7-day moving average increases >20%

2. Intent Expansion

Definition: Agent broadens scope beyond boundaries. Looks like initiative, feels like helpfulness.

Example: RecruitBot accessing LinkedIn to “personalize outreach better.”

Detection signals:

New tools appear in usage logs
Resources accessed outside defined boundaries
First-time operations for this agent

# Expansion detection
APPROVED_TOOLS = {'read_file', 'write_file', 'search'}

def detect_expansion(tool_call):
    if tool_call.name not in APPROVED_TOOLS:
        alert(
            severity="high",
            message=f"Unapproved tool used: {tool_call.name}",
            action="page_on_call"
        )

3. Intent Reinforcement

Definition: Feedback loops strengthen certain behaviors. Agent learns what “works” and doubles down.

Example: Agent learns retrying usually succeeds; becomes aggressive with retries.

Detection signals:

Strategy diversity decreases
Retry rate increases
Same tool dominates usage

# Reinforcement detection
def calculate_strategy_diversity(session):
    actions = session.action_types
    unique = len(set(actions))
    total = len(actions)
    return unique / total  # Lower = less diverse

# Alert if diversity < 0.3

4. Intent Hijacking

Definition: External inputs redirect agent’s goals. Prompt injection, poisoned context, manipulated memory.

Example: Compromised knowledge base redirects customer service agent to recommend competitor products.

Detection signals:

Goal changes abruptly mid-session
Action types change discontinuously
Retrieved context contains unusual patterns

# Hijacking detection
def detect_goal_change(session):
    first_goal = extract_intent(session.turns[0])
    current_goal = extract_intent(session.turns[-1])

    similarity = cosine_similarity(first_goal, current_goal)

    if similarity < 0.5:
        alert(
            severity="critical",
            message="Potential intent hijacking detected",
            action="immediate_review"
        )

5. Intent Repudiation

Definition: Actions can’t be traced back to intent. Investigation stalls without explanation.

Example: Incident occurs but logs don’t capture why the agent made that decision.

Detection signals:

Spans missing intent annotation
Orphan actions without parent workflow
Gaps in audit trails

# Repudiation prevention
def validate_audit_trail(session):
    for action in session.actions:
        if not action.has_intent_annotation:
            log.warning(f"Action {action.id} missing intent annotation")

        if not action.has_parent_trace:
            log.warning(f"Action {action.id} is orphaned")

What to Track

Core Metrics

Metric	What It Captures	Why It Matters
Tool Selection	Which tool was chosen (and alternatives considered)	Detects wrong tool choice
Confidence Scores	How certain the agent was	Low confidence = potential problem
Reasoning Traces	Chain of thought, decision rationale	Debugging, audit
Token Usage	Input/output per step	Cost tracking, efficiency
Action Outcomes	Success/failure of each action	Reliability metrics
Drift Score	Deviation from baseline behavior	Catches gradual changes

Structured Audit Logging

class AgentAuditLog:
    def log_decision(self, state, decision):
        audit_record = {
            # What was decided
            "tool_selected": decision.tool,
            "alternatives_considered": decision.alternatives,
            "confidence": decision.confidence,

            # Why it was decided
            "reasoning": decision.chain_of_thought,
            "relevant_context": decision.context_used,

            # Traceability
            "trace_id": state.trace_id,
            "user_request": state.original_request,
            "step_number": state.step,

            # Metadata
            "timestamp": datetime.now().isoformat(),
            "model": decision.model_used,
            "tokens": decision.token_usage,
        }

        self.emit(audit_record)

Drift Detection Implementation

class DriftDetector:
    def __init__(self, baseline_window=100):
        self.baseline_window = baseline_window
        self.baseline = None

    def calculate_baseline(self, sessions):
        """Calculate baseline from N healthy sessions"""
        return {
            "actions_per_session": {
                "mean": np.mean([s.action_count for s in sessions]),
                "std": np.std([s.action_count for s in sessions])
            },
            "unique_tools": {
                "mean": np.mean([s.unique_tool_count for s in sessions]),
                "std": np.std([s.unique_tool_count for s in sessions])
            },
            "tokens_per_session": {
                "mean": np.mean([s.token_count for s in sessions]),
                "std": np.std([s.token_count for s in sessions])
            }
        }

    def calculate_drift_score(self, session):
        """Z-score based drift detection"""
        z_scores = []

        for metric in ["actions_per_session", "unique_tools", "tokens_per_session"]:
            value = getattr(session, metric.replace("_per_session", "_count"))
            z = (value - self.baseline[metric]["mean"]) / self.baseline[metric]["std"]
            z_scores.append(z ** 2)

        # Root mean squared z-score
        drift_score = np.sqrt(np.mean(z_scores))
        return drift_score

    def interpret_drift(self, score):
        if score < 1.0:
            return "normal"      # Within expected variation
        elif score < 2.0:
            return "unusual"     # Worth logging
        elif score < 3.0:
            return "significant" # Investigate within 24h
        else:
            return "critical"    # Immediate investigation

OpenTelemetry for Agents

The OTEL Gen AI semantic conventions (adopted January 2025) provide standard attributes.

Key Attributes for LLM Calls

Attribute	Type	Description
`gen_ai.system`	string	AI system (“openai”, “azure_openai”)
`gen_ai.request.model`	string	Model requested
`gen_ai.response.model`	string	Model actually used
`gen_ai.usage.input_tokens`	int	Prompt tokens
`gen_ai.usage.output_tokens`	int	Completion tokens
`gen_ai.response.finish_reason`	string	Why generation stopped

Minimal Instrumentation (2 Lines)

from traceloop.sdk import Traceloop
from traceloop.sdk.mcp import McpInstrumentor

# This captures all LLM calls + MCP tool usage automatically
Traceloop.init(app_name="my_agent", api_endpoint=otel_endpoint)
McpInstrumentor().instrument()

What this captures:

All OpenAI/Azure OpenAI chat completions
Token counts (input, output, total)
Model identification
Tool call decisions
Finish reasons
Request/response timing

Zero code in your agent logic. All captured at SDK level.

FinOps from Span Data

Cost visibility comes from the same telemetry.

def calculate_cost(model, input_tokens, output_tokens):
    prices = {
        "gpt-4o": (0.005, 0.015),
        "gpt-4o-mini": (0.00015, 0.0006),
        "gpt-4-turbo": (0.01, 0.03),
        "claude-sonnet": (0.003, 0.015),
    }
    input_price, output_price = prices.get(model, (0.01, 0.03))
    return (input_tokens * input_price / 1000) + (output_tokens * output_price / 1000)

# Cost per request query
SELECT
  trace_id,
  model,
  input_tokens,
  output_tokens,
  calculate_cost(model, input_tokens, output_tokens) as cost_usd
FROM llm_spans
GROUP BY trace_id

Replay for Debugging

When something goes wrong, you need to understand exactly what happened.

class SessionReplay:
    def __init__(self, storage):
        self.storage = storage

    def save_session(self, session_id, events):
        """Save all events for replay"""
        self.storage.save(session_id, {
            "events": events,
            "metadata": {
                "start_time": events[0].timestamp,
                "end_time": events[-1].timestamp,
                "total_events": len(events)
            }
        })

    def replay(self, session_id):
        """Replay session step by step"""
        data = self.storage.load(session_id)

        for event in data["events"]:
            print(f"[{event.timestamp}] {event.type}")
            print(f"  Input: {event.input[:100]}...")
            print(f"  Output: {event.output[:100]}...")
            print(f"  Decision: {event.decision}")
            print(f"  Confidence: {event.confidence}")
            print()

Common Gotchas

Gotcha	Symptom	Fix
Only tracking errors	Miss semantic failures	Track decision quality, not just exceptions
No baseline	Can’t detect drift	Establish baseline from healthy sessions
Missing reasoning	Can’t debug decisions	Capture chain of thought
No correlation	Can’t trace request end-to-end	Use trace IDs consistently
Logging too much	Storage explodes	Sample non-critical events
Alerting too late	See problems in weekly reports	Real-time drift detection

The Observability Checklist

Before deploying an agent:

OBSERVABILITY DEPLOYMENT CHECKLIST

CORE METRICS
[ ] Tool selection tracked with alternatives
[ ] Confidence scores captured
[ ] Reasoning traces logged
[ ] Token usage per step

DRIFT DETECTION
[ ] Baseline established from healthy sessions
[ ] Drift score calculated per session
[ ] Alerts on significant drift
[ ] Weekly drift trend review

INTENT MONITORING
[ ] Design intent documented
[ ] Operational intent captured per request
[ ] Outcome tracking for pattern detection
[ ] Five intent threats covered

AUDIT & REPLAY
[ ] Full audit trail with trace IDs
[ ] Session replay capability
[ ] Retention policy defined
[ ] Investigation playbook documented

FINOPS VISIBILITY
[ ] Cost per request
[ ] Cost by model
[ ] Cost by task type
[ ] Cost alerts configured

Key Takeaways

Agents fail silently. Traditional monitoring misses semantic failures.
Intent matters more than errors. Track whether the agent is doing the right thing, not just running.
Drift is gradual. Establish baselines. Detect deviation early.
Five intent threats require five detection strategies. Drift, expansion, reinforcement, hijacking, repudiation.
Two lines of instrumentation captures 95%. Use OTEL auto-instrumentation.

Next Steps

You can see what’s happening. But most of this complexity has already been solved by durable execution frameworks.

→ Part 6: Durable Execution Frameworks

Or jump to another topic:

Part 7: Security — Sandboxing and prompt injection
Part 8: Testing — How to test agents

Production-agents Series

Why This Matters

Traditional Monitoring vs Agent Observability

The Intent-Centric Security Model

The RecruitBot Parable

Three Intent Levels

The Five Intent Threats

1. Intent Drift

2. Intent Expansion

3. Intent Reinforcement

4. Intent Hijacking

5. Intent Repudiation

What to Track

Core Metrics

Structured Audit Logging

Drift Detection Implementation

OpenTelemetry for Agents

Key Attributes for LLM Calls

Minimal Instrumentation (2 Lines)

FinOps from Span Data

Replay for Debugging

Common Gotchas

The Observability Checklist

Key Takeaways

Next Steps

Table of Contents