AI Agent Observability: What to Track in Production

AI agent observability is the discipline of recording every tool call, every LLM invocation, and every dollar an autonomous agent spends — capturing the non-deterministic decisions, cumulative cost, and prompt-injection attempts that traditional APM never had to handle. Without it, a production fleet operates on faith, and one stuck agent can burn hours of compute before anyone notices the bill.

What is AI agent observability?

AI agent observability is the discipline of capturing structured telemetry from autonomous AI agents — execution traces, token spend, prompt versions, tool-call audits, and security events — so engineers can debug, govern, and optimize agents running in production.

TL;DR

Agent observability is harder than app observability because the agent's control flow is decided by an LLM at runtime, not by handwritten code.
Four signals matter: end-to-end traces, per-agent cost, prompt and model versioning, and security-event capture (prompt-injection attempts, ACL denies, budget cutoffs).
Most agent frameworks ship with no built-in observability — teams bolt on LangSmith, Langfuse, or Arize Phoenix and discover the gaps after their first production incident.
OpenLegion's mesh dashboard records every tool call, LLM request, cost line, and security event by default — no instrumentation code, no third-party agent integration.
Cost observability is the budget you didn't know you were spending: without per-agent caps, one stuck agent can burn hundreds of dollars in API calls overnight.

Why AI Agent Observability Is Different

Datadog, Honeycomb, New Relic — every traditional APM tool was built on two assumptions: code paths are deterministic, and request handlers are human-written. Autonomous agents break both, in four specific ways:

The control flow is generated, not coded. An agent decides at runtime whether to call a tool, retry, hand off to another agent, or give up.
Cost is unbounded by default. Each LLM call can chain to more calls. Without per-agent budget caps, a runaway loop is a runaway invoice.
The error surface is dual: standard failures (timeout, 5xx) plus LLM-specific failures (hallucinated tool name, malformed JSON, refusal, prompt-injection success).
Auditability is a compliance requirement, not a nice-to-have. Regulated teams need to prove what an agent did, when, with which prompt, on whose data.

The practical consequence: a standard APM dashboard tells you the agent run took 12 seconds. It doesn't tell you the agent made 47 LLM calls to get there because it hallucinated a database column name on attempt #3 and entered a retry loop.

The Four Signals You Actually Need

1. End-to-end execution traces

Every agent run modeled as a tree: parent task → tool calls → LLM round-trips → child agent handoffs. Span-level latency, status, and inputs and outputs. OpenTelemetry's GenAI semantic conventions are converging here; tools that implement them — Langfuse, Arize Phoenix, Helicone — interoperate.

2. Cost per agent, per task, per provider

Token counts, dollar conversions per provider, and rollups by agent, project, and team. Cost is the budget signal that should hard-cut execution, not just chart it after the fact.

3. Prompt and model versioning

When the agent regressed, was it the prompt change, the model upgrade, or upstream data drift? Without versioned prompts pinned to runs, you cannot tell. Prompt registries (LangSmith Hub, Langfuse Prompts, Promptlayer) all solve this; the runtime has to record which version each run actually used.

4. Security events

Prompt-injection attempts, ACL denies, SSRF blocks, budget cutoffs, unicode sanitization hits. These are the events compliance reviewers ask about — and the events that signal an in-progress attack on your agent fleet.

What OpenLegion Tracks by Default

Signal	What's captured	Where to see it
Trace	Every tool call, LLM request, agent handoff with timing	Mesh dashboard → Agent Runs
Cost	Tokens in/out, dollar cost per provider per agent	Dashboard → Cost panel
Prompts	System prompt hash, version, model, parameters per run	Per-run detail view
Security	ACL denies, budget cutoffs, SSRF blocks, sanitizer hits	Dashboard → Security log
Health	Container resource use, mesh latency, browser pool state	Dashboard → Fleet panel

The dashboard is part of the open-source runtime — not a managed service you have to subscribe to. Self-hosted deployments keep all telemetry on your infrastructure.

Open-Source vs Managed Observability Stacks

If you are running a different agent framework, the leading bolt-on tools are LangSmith (LangChain ecosystem, managed), Langfuse (open-source, self-hostable), Arize Phoenix (open-source, evaluation-focused), and Helicone (proxy-based, simple integration). Each requires instrumentation code in your agent — wrap LLM clients, add callback handlers, configure trace exporters. The integration burden scales with your fleet size.

OpenLegion's mesh sits in the call path of every agent operation by design — credential vault, ACL gate, cost tracker, and trace recorder are all colocated in the trusted zone. There is no instrumentation step. The trade-off: you adopt the OpenLegion runtime, not just an observability layer.

See our AI agent frameworks comparison for the full landscape, or the vs LangGraph page for a head-to-head on observability specifically.

OpenLegion's Take

Agent observability is the new APM — and the AI ecosystem is repeating every mistake APM took a decade to fix. Telemetry fragments across vendor-specific SDKs. Pricing scales with event volume so the busiest fleets pay the most to watch themselves. "Advanced" features like alerting and retention sit behind enterprise tiers. OpenLegion takes the opposite stance: the dashboard, traces, cost ledger, and security event log ship with the AI agent platform, not as an upsell. Every run records the full trace by default, you self-host the data, you own the retention, and you can export to OpenTelemetry if you want it forwarded to Datadog or Honeycomb anyway.

Production agents need production observability — built in, not bolted on.

Get Started Read the Docs

Frequently Asked Questions

What is AI agent observability?

AI agent observability is the structured recording of an autonomous agent's runtime behavior — tool calls, LLM invocations, prompt versions, costs, and security events — so engineers can debug failures, optimize cost, and audit decisions. It is distinct from traditional APM because the agent's control flow is decided by an LLM, not by handwritten code.

How is AI agent observability different from LLM observability?

LLM observability tracks individual model calls — prompt, response, latency, token cost. AI agent observability tracks the full execution graph an agent traverses to complete a task, which typically involves many LLM calls plus tool calls, handoffs to other agents, retries, and state mutations. LLM observability is a subset of agent observability.

Do I need a separate observability tool if I'm already on Datadog?

Datadog and similar APM tools handle latency, errors, and resource use well, but they do not natively understand LLM token costs, prompt versioning, or agent-trace semantics. Most teams pair an agent-native observability tool (Langfuse, Arize Phoenix, LangSmith) with their existing APM, or adopt a runtime like OpenLegion that ships telemetry built in and can export OpenTelemetry to whatever APM they already operate.

What should I track for AI agent cost observability?

Track token counts (input and output) per provider per agent per run, dollar cost calculated against current provider pricing, per-agent daily and monthly rollups, and budget-cutoff events when an agent is stopped for exceeding its allotment. Without per-agent budget caps, even excellent observability only tells you about a runaway after the bill arrives.

What security events should AI agent observability capture?

At minimum: prompt-injection detection, ACL denies (an agent attempted an operation outside its permission boundary), SSRF blocks, unicode and path-traversal sanitization hits, budget cutoffs, and credential-vault access logs. These are the events compliance reviewers ask about, and the events that signal an active attack against your agent fleet.

How does OpenLegion's observability compare to LangSmith?

LangSmith is a managed observability service for the LangChain ecosystem — strong tracing, evaluation, and prompt-management features. OpenLegion's dashboard ships with the runtime itself, is self-hosted by default, and records the same signals (traces, cost, prompts, security events) without requiring instrumentation in your agent code. LangSmith integrates across any framework that adopts it; OpenLegion observability works automatically inside the OpenLegion runtime.