Observability When Your Team is 30 AI Agents

I introduced Sol recently as a framework for orchestrating concurrent AI coding agents. The architecture post covers the what and why. This one goes deeper on a specific problem that consumed more design time than any other: how do you observe a system where the "workers" are non-deterministic AI models that might be thinking, might be stuck, or might have silently crashed?

Traditional infrastructure monitoring assumes your workers are processes that either respond to health checks or don't. AI agents are different. They're interactive sessions that go quiet for minutes at a time, produce output in bursts, and fail in ways that look a lot like normal operation.

The core problem: stuck vs. thinking

When a human developer goes quiet for 20 minutes, you assume they're thinking. When a web server goes quiet for 20 seconds, you assume it's dead. An AI coding agent falls somewhere in between, and that ambiguity is where most naive monitoring approaches break down.

Sol's answer is output hashing. Every 3 minutes, the Sentinel (per-world health monitor) captures the last 80 lines of each agent's terminal output and computes a SHA-256 hash. If the hash hasn't changed between two consecutive patrols, the agent might be stalled.

But "might be" isn't good enough to act on. A stalled hash could mean:

The agent is genuinely stuck
The agent is waiting for a long-running test suite
The agent is reading a large file and hasn't produced output yet
The terminal scrollback hasn't updated

So instead of immediately restarting the agent (expensive, potentially destructive), Sol fires a targeted AI assessment. A separate, cheap model call looks at the terminal output and returns a structured verdict: is this agent progressing, stuck, waiting, or idle? Only high-confidence "stuck" assessments trigger intervention.

This two-stage approach (cheap hash check, expensive AI assessment only when needed) keeps costs manageable. Without it, monitoring 30 agents at 3-minute intervals would burn through 600 API calls per hour just on health checks.

Cascade failure detection

The scariest failure mode in a multi-agent system isn't a single crash. It's when an upstream problem (API rate limit, git credential expiry, network partition) kills agents in rapid succession.

Sol tracks session death timestamps. If 3 or more agents die within a 30-second window, the Prefect (sphere-level supervisor) enters degraded mode. It stops respawning agents, logs an event, and waits for operator intervention. The reasoning: if something is killing agents that fast, automatically respawning them will just generate more failures, burn through rate limits faster, and make the underlying problem harder to diagnose.

This is an application of Sol's DEGRADE principle: subsystems going down should mean reduced capacity, not cascading collapse. A dead Sentinel means stalls go undetected, not that agents stop working. A dead Forge means merges queue up, not that work halts. Each component's failure mode is designed to leave the system in a recoverable state.

Heartbeats at every layer

Sol runs multiple long-lived processes per world: Sentinel, Forge, Broker, Ledger, Chronicle. Each writes a heartbeat file (JSON) on its own cadence:

Forge: every 10 seconds (it's a tight merge loop)
Sentinel: every 3 minutes (patrol interval)
Broker: every 5 minutes (provider health probes)
Consul: every 5 minutes (recovery patrol)

The Prefect checks these heartbeats and restarts any process that's gone stale. But there's a subtlety: a process can be alive (PID exists, tmux session running) but not progressing (heartbeat timestamp frozen). The heartbeat includes a cycle count, so the Prefect can distinguish between "process is alive and working" and "process is alive but hung."

{
  "pid": 48291,
  "patrol_count": 142,
  "agents_checked": 12,
  "stalled_found": 0,
  "reaped": 0,
  "health": "healthy",
  "consecutive_failures": 0,
  "timestamp": "2026-04-02T14:30:00Z"
}

Cost attribution

When you're running 30 agents concurrently, token costs add up fast. You need to know which agents are burning tokens and on what work.

Sol runs a Ledger service that acts as an OTLP (OpenTelemetry) receiver on port 4318. Agent runtimes (Claude Code, Codex) emit log events through the standard OTLP contract. The Ledger extracts token metrics (input, output, cache reads, cache creation) and writes them to per-world SQLite databases.

Every token event is attributed to a specific session, agent, writ (work assignment), and model. You can answer questions like:

How much did this feature cost across all agents that worked on it?
Which agent model (Sonnet vs. Opus) is more cost-effective for this type of task?
Are any agents burning excessive tokens on a stuck task?

The last question ties back to observability. An agent that's consumed 500K tokens on a task that similar agents completed in 50K tokens is probably stuck in a loop, even if its output hash is changing. Cost anomaly detection is health monitoring by another name.

The GLASS principle: everything is inspectable

One design decision that paid off more than expected: every piece of Sol's state is inspectable with standard unix tools.

Event feed: cat $SOL_HOME/.events.jsonl | jq
Agent states: sqlite3 $SOL_HOME/sphere.db "SELECT name, state FROM agents"
Token costs: sqlite3 $SOL_HOME/world/world.db "SELECT * FROM token_usage"
Heartbeats: cat $SOL_HOME/world/forge/heartbeat.json | jq
Agent output: tmux capture-pane -t sol-agent-1 -p
Work assignments: ls $SOL_HOME/world/outposts/agent-1/.tether/

No custom dashboards required. No proprietary query language. When something goes wrong at 2 AM, you can debug it with sqlite3, jq, cat, and tmux. This matters more than it sounds. The fanciest observability stack is useless if the person debugging the problem can't query it under pressure.

What I'd do differently

Three things I'd change if starting over:

Structured logging from day one. Sol's early components used slog with inconsistent field names. Retroactively standardizing these was tedious. Define your log schema before writing any components, not after.

Centralized alerting. Each component currently handles its own notifications. A unified alert router (escalation rules, dedup, grouping) would be cleaner. Sol partially addresses this with the Consul's escalation system, but it was added later and the seams show.

Metrics, not just events. The JSONL event feed captures what happened, but not trends over time. Adding time-series metrics (agent throughput, merge latency, cost per hour) would enable better capacity planning. This is something I'm considering adding to the roadmap.

The meta-lesson

Building observability for AI agents taught me that the traditional monitoring pyramid (metrics, logs, traces) still applies, but the interpretation layer is fundamentally different. You can't set a static threshold for "agent response time" because there is no such thing. You can't alert on error rates because the agent's errors are expressed as natural language, not HTTP status codes.

What works instead: behavioral monitoring (is the agent's output changing?), cost anomaly detection (is it burning tokens faster than expected?), and cascade detection (are multiple agents failing simultaneously?). These are the signals that matter when your team is 30 AI agents.