News

MCP Needs an Observability Spec Before the Ecosystem Splinters

The Model Context Protocol standardized how AI agents discover and invoke tools. But it says nothing about how agents report what they did. A growing number of builders think that silence is becoming a liability — and they're sketching what a fix could look like.

Jan Schmitz Jan Schmitz | | 12 min read
MCP Needs an Observability Spec Before the Ecosystem Splinters

TL;DR: MCP gave AI agents a universal way to discover and call tools. It did not give them a universal way to report what happened afterward. No standard trace format. No shared evaluation interface. No convention for reporting token usage or cost. The result: Every observability tool in the ecosystem invented its own schema, and none of them talk to each other. A growing proposal from the community, anchored by lessons from the Iris observability server, lays out four concrete additions to the MCP spec that could close the gap before fragmentation becomes permanent.


MCP Needs an Observability Spec Before the Ecosystem Splinters

Here’s a number that should keep anyone running AI agents in production up at night: $14,000 per month.

That was the bill for a single research agent at one organization. The agent returned correct answers. Latency sat under two seconds. Error rates hovered at 0.1%. By every metric the monitoring stack could surface, the system looked perfectly healthy. The problem was the metric nobody tracked (token cost per query) because there was no standard way to track it.

This is the predictable outcome of a protocol that poured enormous effort into standardizing what agents can do and zero effort into standardizing what agents did do. A proposal now circulating in the MCP community argues the window to fix this is measured in months, not years.

The protocol that forgot to look in the mirror

The Model Context Protocol has had a hell of a run since Anthropic open-sourced it. Over 6,400 servers on the official registry. Monthly SDK downloads past 97 million. OpenAI, Google, Microsoft, and Amazon all adopted it. MCP became the shared language between AI agents and the outside world faster than anyone expected.

The spec, as of March 2026, defines four core primitives: Tools (functions agents can invoke), Resources (data agents can read via URI), Prompts (templated instructions agents discover), and Transport (the communication layer, covering stdio, HTTP with SSE, and Streamable HTTP). These cover what agents are capable of and how they communicate.

What they don’t cover is accountability.

There’s no trace primitive. No eval primitive. No standardized metadata field for cost, token consumption, or model identity on tool call responses. The protocol is articulate about capabilities and mute about outcomes.

Ian Parent, creator of the MCP-native observability server Iris, put it bluntly in a recent proposal: “The protocol that standardized tool invocation has nothing to say about tool observation.” His post (part experience report, part draft specification) has kicked off the kind of discussion that tends to precede actual standards work.

What fragmentation actually costs you

“Fragmentation” sounds abstract until you try to do something simple. Like comparing agent traces across two teams.

Say team A uses Langfuse for MCP tracing. Their traces carry OpenTelemetry-style span conventions: traceId, spanId, parentSpanId, attribute maps. Team B built a lightweight internal tool. Their traces are flat JSON: input, output, latency_ms, a tool_calls array. Team C ships everything to a commercial platform with a proprietary event stream format tuned for their cloud backend.

All three teams work on the same product. All three generate useful observability data. None of it is compatible.

You can’t export from one and import to another. You can’t build a unified dashboard without writing three separate adapters. If you switch providers, your historical data either stays locked in the old format or you burn engineering time on a migration nobody wanted to build. And when leadership asks “what are our agents costing us, across all teams?” Good luck with that.

This isn’t hypothetical. Langfuse acknowledged the challenge directly by building trace-linking through MCP’s _meta field, but that’s one tool’s workaround for one slice of the interoperability puzzle. The underlying problem (no shared schema at the protocol level) remains wide open.

The parallel to early microservices observability is uncomfortable. Before OpenTelemetry converged the ecosystem, every APM vendor had its own trace format, its own agent, its own context propagation scheme. Teams spent more time wiring together observability infrastructure than actually analyzing what it produced. OTel succeeded in part because it arrived before full fragmentation set in. It laid a shared foundation while the ecosystem was still young enough to adopt one.

MCP’s observability ecosystem is sitting at that same inflection point right now.

Four proposals that could close the gap

Parent’s sketch isn’t a finished specification. He’s upfront about that. It’s a starting point for community discussion, grounded in the specific problems he ran into building Iris. But the four additions he proposes are concrete enough to evaluate, and each one targets a real friction point.

1. A standard trace schema

The spec should define a minimal trace object that any MCP-compatible observability tool can produce and consume. The proposed structure includes a top-level mcp_trace object with a version identifier, trace ID, agent name, timestamps, input/output strings, a spans array with parent-child relationships, token usage counters, cost data, and an extensible metadata field.

The span tree model borrows from OpenTelemetry’s playbook. Each span carries its own span_id, parent_span_id, tool_name, tool_server, timing data, and status. This handles the common case where agents call tools that call other tools, creating hierarchical execution chains that a flat trace format can’t represent.

The metadata fields at both the trace and span level are deliberate escape hatches. The core fields form the interoperability contract. Everything else (custom attributes, vendor-specific enrichment, experimental data) lives in metadata without breaking compatibility.

Iris learned the hard way that flat traces fall apart once tools start calling other tools. Early versions used a single-object-per-execution model, which held up fine until it didn’t. The move to span trees was a clear improvement, but it also surfaced a key tension for spec design: The standard needs to support both simple (flat) and complex (hierarchical) traces without forcing the complex case on everyone from day one.

2. A standard evaluation interface

Evaluation is where the fragmentation problem gets ugly. Every eval tool defines its own rule format, its own scoring schema, its own method for connecting scores to traces.

The proposal defines a standard mcp_eval tool interface. Not a standard evaluation method (which would be impossible to agree on), but a standard contract. Input: A trace ID, the output being evaluated, and an array of rules (each with an ID, category, and threshold). Output: An array of scores (each with a rule ID, a 0-to-1 normalized score, a boolean pass/fail, and optional detail text), plus an aggregate score.

That 0-to-1 normalized scale is a specific, hard-won design decision. Iris went through unbounded scores, percentages, and letter grades before landing on it. Normalized scores are the only format that composes cleanly across rules and categories. A safety score of 0.85 from tool A and a safety score of 0.72 from tool B might use completely different methods under the hood (heuristic regex matching versus LLM-as-judge versus a custom classifier), but with a shared schema, you can aggregate them, trend them, and set alerts without writing adapters.

The standardized categories (completeness, relevance, safety, cost, custom) give the ecosystem a common vocabulary without constraining implementations. If your eval tool detects PII with regex and mine uses a fine-tuned model, the output format is identical. The consumer of that eval data doesn’t need to know or care about the internals.

3. A cost metadata field on tool responses

This is the smallest proposed change and the one with the biggest practical bite.

Right now, when an MCP tool wraps an LLM call internally (and plenty of production tools do), the token usage and cost are invisible to the calling agent and to any observability layer sitting on top. The tool returns its output. The operational cost disappears into a black box.

The proposal adds an optional _mcp_meta field to tool call responses carrying token_usage (prompt and completion tokens), cost_usd, model identifier, and latency_ms. Simple. Optional. But it changes cost visibility entirely.

Here’s why this matters at scale. Take a setup with 10 agents, each handling 100 traces per day at an average cost of $0.07 per trace, a $70-per-day baseline. If just three of those agents develop a cost regression (bad prompt engineering, oversized context windows, unoptimized retrieval), jumping to $0.35 per trace, your monthly spend goes from $2,100 to $4,620. Over a quarter, that’s $7,560 in excess spend. You won’t notice until the invoice shows up, because nothing in the protocol surfaces what individual tools cost.

Parent reports that cost aggregation was “one of the most requested capabilities” from teams using Iris. The question every engineering manager asks (what is this agent actually costing me?) can’t be answered without cooperation from tool servers. The _mcp_meta field turns cost reporting from a favor into a convention.

4. Standard resource URIs for observability data

MCP resources use URIs, and this is one of the protocol’s most underused features. Iris already takes advantage of it. Agents can read iris://dashboard/summary to get a structured overview of recent traces and scores. But the URI scheme and data format are Iris-specific.

The proposal reserves a mcp-trace:// URI scheme for observability resources:

  • mcp-trace://traces/latest for recent trace data
  • mcp-trace://traces/{trace_id} for a specific trace
  • mcp-trace://dashboard/summary for an aggregated overview
  • mcp-trace://evals/{trace_id} for evaluation results for a trace
  • mcp-trace://costs/aggregate?window=24h for cost rollup over a time window

Standardized URIs mean that an agent, dashboard, or downstream tool can read mcp-trace://traces/latest from any observability server and get back a structurally identical response. Swap out your observability provider, your dashboards keep working. Self-monitoring agents (ones that read their own trace history to spot error patterns or adjust behavior) work against any backend that implements the scheme.

This is the interoperability layer that makes the other three proposals useful in practice. Without it, you have portable schemas but vendor-locked access patterns.

The elephant in the room: OpenTelemetry

Any conversation about observability standardization eventually bumps into the same question. Why not just use OpenTelemetry?

Fair question. OTel already defines trace formats, span conventions, context propagation, and has a GenAI Special Interest Group actively developing semantic conventions for AI agents and LLM systems. AG2, CrewAI, LangGraph, and others are already building OTel instrumentation. Datadog natively supports OTel GenAI semantic conventions. This is a real, functioning ecosystem.

But the answer isn’t “don’t use OpenTelemetry.” It’s that OTel and an MCP observability spec solve different layers of the same problem.

OpenTelemetry provides general-purpose distributed tracing infrastructure. It knows about spans, metrics, logs, and attribute maps. What it doesn’t know about: MCP-specific concepts like the relationship between an agent’s reasoning steps and its tool invocations, which MCP server a tool call hit, what tokens a specific tool consumed internally, or how an evaluation score maps to a trace ID.

An MCP observability spec would define the semantics: What a trace means in the context of MCP tool calls, what fields matter, what evaluation looks like. OTel would be one (very good) option for the transport and storage layer underneath. These aren’t competing standards. They’re different layers of the stack.

The OTel GenAI SIG is building framework-level conventions, but those target broad AI observability: Model calls, vector database operations, agent systems in general. MCP needs something tighter. A schema that captures the protocol’s own primitives (tools, resources, prompts) and the agent behavior patterns that are unique to protocol-mediated tool use.

The gateway gap-fill (and why it’s not enough)

While the spec discussion plays out, the market hasn’t stood still. MCP gateways like Bifrost, MCPX, Maxim, and others are racing to fill the observability gap at the infrastructure layer.

These gateways sit between agents and MCP servers, offering rate limiting, authentication, and increasingly, observability features: Latency tracking, token monitoring, cost dashboards, request tracing. Bifrost’s Code Mode even claims 50%+ reduction in token usage for multi-tool orchestration by letting models generate TypeScript orchestration code instead of loading hundreds of tool schemas.

The gateway approach addresses the immediate pain. If your agents are burning money right now, you don’t have time to wait for a specification process. You need a proxy that can count tokens today.

But gateways are a band-aid on the interoperability problem, not a fix. Each gateway defines its own telemetry format. Data collected by Bifrost can’t be queried from MCPX’s dashboard. A team using Maxim’s analytics can’t pipe that data into a custom analysis pipeline built for a different gateway’s output format. You’ve traded tool-level fragmentation for gateway-level fragmentation. A marginal improvement at best.

Protocol-level standardization is what makes gateway observability composable. If every gateway emits traces in the same format, the gateway market competes on performance, reliability, and features, not on lock-in.

The ToolHive precedent

The observability gap has already caught the attention of the Kubernetes community. ToolHive, a container-native MCP management tool from Stacklok, generates metrics and traces in standard OTel and Prometheus formats, specifically because MCP servers typically ship without /metrics endpoints, structured logging, or tracing hooks.

Their rationale tells the whole story. When AI systems misbehave, teams can’t determine which MCP servers were involved. When response times spike, there’s no visibility into whether the bottleneck sits in the model, the MCP server, or the external system the server calls. Even organizations with thorough Prometheus and OTel stacks find their MCP servers remain invisible. Dark spots in an otherwise well-lit infrastructure.

ToolHive’s approach (wrapping MCP servers in containers that emit standardized telemetry) works, but it’s fundamentally an outside-in solution. It can observe network behavior, timing, and error codes. It can’t see token usage, model identity, or evaluation results, because the MCP server doesn’t expose that information. The _mcp_meta proposal would give tools like ToolHive something to actually observe.

Timing matters more than perfection

The MCP community has a window right now. The ecosystem has gone from zero to 6,400+ registered servers and 97 million monthly SDK downloads in about 18 months, but production deployments are still early enough that a specification can shape implementations rather than chase them.

That window is closing. Every new observability tool that launches defines its own schema. Every MCP gateway that ships creates another proprietary telemetry format. Every team that builds internal tracing tooling makes choices that become harder to reverse with each passing quarter.

The proposal isn’t asking for perfection. It’s asking for three things: Recognition that observability belongs in the protocol spec, not just in third-party tooling. Discussion about which primitives the spec should own versus which it should leave to implementers. And collaboration on a minimal viable observability standard that tool authors can adopt incrementally.

The field names can change. The URI scheme can evolve. The versioning strategy needs community input. But the core bet, that a minimal, extensible observability primitive at the protocol level beats the status quo of zero interoperability, is hard to argue against with a straight face.

What this means if you’re building on MCP today

If you’re deploying MCP agents in production, the observability gap affects you whether or not you follow the spec discussion. Some concrete things worth doing now:

Track cost per trace, not just per month. Don’t wait for the spec. Instrument your agents to log token usage and estimated cost on every execution. When (not if) a prompt regression inflates your per-query cost by 5x, you want to catch it on day one, not 30 days later when finance flags the invoice.

Pick observability tools that align with likely standards. Tools that use span-tree trace models, 0-to-1 normalized evaluation scores, and expose data via URIs are closer to where consensus is heading. You’ll have less migration pain when a standard lands.

Watch the _meta field. Langfuse already uses MCP’s _meta field for trace context propagation between clients and servers. This is the most likely attachment point for standardized observability metadata. If you’re building MCP tools, consider adding operational metadata to your responses via _meta now, even in a proprietary format. You’ll be ahead of the curve when the spec catches up.

Get in the room. The conversation is happening on GitHub Discussions and in the MCP Discord. If you’ve hit interoperability walls, your war stories are exactly the input that turns a sketch into a real specification.

The bet worth making

Protocols live and die on completeness. HTTP without status codes would have been a toy. SQL without EXPLAIN would have left database optimization to guesswork. TCP without acknowledgment packets would have been UDP with delusions of grandeur.

MCP without observability primitives is a protocol that tells agents how to act and gives them no standard way to account for their actions. For weekend projects and demos, that’s fine. For enterprises running agents that touch customer data, make financial decisions, and rack up five-figure monthly API bills, it’s a gap that gets more expensive with every month it stays open.

The proposal on the table isn’t radical. A trace schema. An eval interface. A cost metadata field. A set of resource URIs. Four additions, each optional, each extensible, each solving a problem that every team deploying MCP agents in production has already run into.

The real question isn’t whether MCP needs this. It’s whether the community builds it together now, or each team builds its own version separately and then spends the next three years writing adapters between all of them.

If the OpenTelemetry story taught us anything, it’s that the first path is always cheaper than the second.

Share this post

Want structured YouTube intelligence?

Content gap analysis, title scoring, thumbnail intelligence, and hook classification. Delivered via API and MCP server.

Get your free API key →