News

The Website Is Not the Product, Context Is

Anthropic's crawler took 60,000 pages for every visitor it returned in mid-2025. Web publishing's distribution layer is being extracted, not visited. Here's what comes after the website, why MCP endpoints are the next surface, and which publishers are already moving.

Jan Schmitz Jan Schmitz | | 11 min read
The Website Is Not the Product, Context Is

TL;DR: Cloudflare’s data shows Anthropic’s crawler pulled around 60,000 pages from publishers for every human visitor it sent back in mid-2025, up from 6,000:1 six months earlier. OpenAI sits at roughly 1,500:1. Even Google has slid from 2:1 a decade ago to 18:1 today. AI systems are extracting the web at industrial scale without paying the traffic tax that funded it, and better SEO won’t fix it. What might is a different distribution layer: Structured content served through MCP endpoints, indexed by llms.txt, paid for by access to context instead of page views. Almost no major publisher is building it, which is where the opportunity sits.


The Website Is Not the Product, Context Is

In June 2025, Cloudflare CEO Matthew Prince sat down for an Axios interview at the Cannes Lions festival and put a number on something publishers had been feeling for two years. Ten years ago, Google sent a publisher one human visitor for every two pages it crawled. Six months before he spoke, the ratio was one visitor per six pages. By the time he stood up, it was one per 18.

Then he turned to AI. OpenAI had gone from 250 pages crawled per referral to 1,500. Anthropic had gone from 6,000 to 60,000.

“People aren’t following the footnotes,” Prince told the room. He meant it as a diagnosis, not a metaphor. The old bargain of the open web (you let crawlers in, they send you readers) is being unwound in real time. Publishers who think this is a content-quality problem, or a metadata problem, or a vibes problem about Google updates are watching the wrong dashboard.

What’s happening is bigger than a search algorithm changing its mind. The distribution layer of the web is being rebuilt for a different kind of consumer.

The number that should end the SEO conversation

The Cloudflare data deserves a careful read, because the headline usually gets misquoted.

Anthropic’s ClaudeBot was pulling close to 70,900 pages per referral in the last week of June 2025, and it had spiked far higher than that earlier in the year, according to Cloudflare. By July, after Anthropic shipped its own web-search feature and started sending some traffic back, the figure dropped to around 38,000:1. It kept improving into 2026: By the first half of March it was running near 11,700:1, on figures compiled from Cloudflare Radar.

So things got better. They’re still catastrophic.

A ratio of 11,000:1 means a publisher’s content seeds eleven thousand answers for every human who clicks through. Googlebot, for comparison, runs at around 5:1. OpenAI’s GPTBot sits closer to 1,300:1. Perplexity is far lower, around 110:1, though that’s well up from where it started 2025.

Now put that next to another number from the same dataset. The AI chatbot referral channel (ChatGPT, Gemini, Claude and Perplexity combined) accounts for about 0.29% of search referrals. Agents are eating a meaningful share of intent at the top of the funnel while sending almost nothing to the bottom of it.

The Wall Street Journal, the New York Times, MIT Technology Review, every B2B publisher you can name: They all run on the same equation. Content costs money to produce, and attention is what pays for it. When the attention layer disappears, the equation breaks. U.S. organic search traffic fell about 2.5% over the year to November 2025 on Graphite’s read of Similarweb data, and Google referrals to publishers dropped 38% in the U.S. over a similar window, on Chartbeat data in the Reuters Institute’s 2026 trends report. AI Overviews have climbed from about 13% of U.S. searches in early 2025 into the 16-to-25% range by early 2026, depending on whose methodology you trust.

You can’t SEO your way out of this. The reader stopped being the destination.

Three readers, three incompatible products

Here’s the framing that gets the next decade right. Publish anything (research, journalism, documentation, product data, analysis) and you now have three kinds of consumer reading you, each wanting something different.

Humans want narrative. A reason to keep scrolling. Context, voice, brand, the small dopamine hit of a well-built page. They put up with ads because they accepted that trade a long time ago.

LLMs used as tools want dense signal: Structured summaries, tight passages, machine-friendly markup, as little noise per token as you can manage. Long preambles cost them money. Repetitive framing actively degrades the answer they synthesise.

AI agents acting on their own want endpoints, not pages. An agent is running a workflow for some human (book the flight, compare the contract terms, monitor the regulatory filing, find the engineering spec), and it needs a callable interface, schemas it can trust, and the right slice of knowledge at the right granularity.

Most content strategies still optimise hard for the first group. A growing number of teams are tweaking for the second, adding FAQ sections, restructuring H2s, shipping llms.txt, writing more declarative leads. Almost nobody is building for the third.

That third gap is where the next distribution advantage lives, and a handful of companies have already started building the layer beneath the website.

llms.txt is the right idea, halfway

Jeremy Howard proposed llms.txt in September 2024. It’s a markdown index file you drop at your domain root that tells AI systems what your site contains, where the canonical version of each thing lives, and how to read it. Think robots.txt, but for meaning instead of access.

The adoption story is messier than the marketing suggests.

About 10% of indexed websites now publish an llms.txt, according to an SE Ranking study of roughly 300,000 domains. That sounds modest until you see who’s on the list: Anthropic, Stripe, Cursor, Cloudflare, Vercel, Mintlify, Supabase, LangGraph, Coinbase, Pinecone. Almost every API-first developer tools company ships one. Stripe’s llms.txt is the one everyone points to, partly because it includes a block of plain-English instructions written straight at LLM agents.

Here’s the part the cheerleading articles tend to bury, though. The major LLM crawlers from OpenAI, Google and Anthropic don’t currently fetch llms.txt in any real volume. IDE-side agents do. Cursor, Continue, Cline, Aider and the Mintlify-style documentation MCPs read it religiously. The web-crawling pipelines training the next foundation model? Not so much.

So in mid-2026, llms.txt is really a developer-experience play, not an SEO win. It optimises the moment a human opens Cursor and points it at your docs. It does almost nothing to your crawl-to-refer ratio.

And even when it works perfectly, it hits a ceiling. The file describes content; it doesn’t serve it. It tells the agent “go here, then here, then here,” and the agent still has to crawl HTML pages built for humans, parse them, strip the navigation, skip the cookie banner, and rebuild the structured meaning the publisher already had sitting in a database somewhere.

That’s wasted motion on both sides. The publisher built a website to serve humans. The agent built a parser to undo it. Between those two efforts, something is missing.

The missing layer is MCP

The Model Context Protocol is the closest thing the industry has to that missing piece.

Anthropic open-sourced MCP in November 2024 as plumbing for connecting AI assistants to tools and data sources. Within about a year it had 97 million monthly SDK downloads, more than 10,000 active public servers, and first-class client support across Claude, ChatGPT, Cursor, Gemini, Microsoft Copilot and VS Code. In December 2025, Anthropic donated the protocol to a new Linux Foundation body, the Agentic AI Foundation, co-founded with OpenAI and Block. Going from internal project to standards-body governance that fast is rare; the obvious comparison is Kubernetes.

For publishers, the governance story isn’t the point. What matters is the shape of an MCP server.

It exposes typed tools. An agent connects, asks “what can you do?”, and gets back a list of capabilities: search_articles, get_pricing_history, lookup_filing(company, quarter), retrieve_definition(term). It calls a tool with structured arguments and gets structured data back. No HTML to parse, no layout to navigate, no cookie banner to dismiss. The publisher decides which tools to expose, who can call them, how often, and at what price.

If you publish knowledge, that’s the surface that matters now: The answered query, not the page view.

A few examples worth watching. Stripe ships an MCP server that hands account-aware product, pricing and payment context to coding agents, and it’s the clearest case I can point to of MCP being treated as a real distribution channel rather than a demo. TollBit runs a paywall and metering gateway that lets publishers charge AI crawlers and agents for access, and has started exposing that content through MCP; the thing being sold is access to a verified fact rather than a page impression. Bloomberg sits on the Agentic AI Foundation’s platinum tier (with Datadog a tier below, on gold) specifically to shape how authenticated, paid agent access works, and already runs an internal MCP layer over its Terminal data. And the IAB Tech Lab has stood up its Content Monetization Protocols (CoMP) for AI working group to standardise how publishers and AI systems transact.

None of this is hypothetical. The infrastructure is built, the billing works, the clients are shipping. What’s missing is publishers treating any of it as a priority.

What’s actually missing is the conceptual leap

The piece that sent me down this path put it bluntly: The gap isn’t technical, the tools exist. A small team can ship an MCP server in a week. An llms.txt takes an afternoon. The hard part is the change in worldview.

Substack doesn’t have an llms.txt. The New York Times doesn’t have an MCP endpoint. The Atlantic, the Financial Times, Vogue, Wired, the Verge, TechCrunch: None of them publish a callable knowledge interface. They publish websites, and they treat the website as the asset.

The website is the UI. Context is the product.

That reads as glib, but it’s the load-bearing sentence in the whole argument. A publication’s real asset is the structured knowledge in its archive: The verified quotes, the dated facts, the curated taxonomy, the editorial judgement about what counts as authoritative. The website was only ever one rendering of that asset, built for one particular reader, the human with a browser open.

The new readers need a different rendering: Cleaner, schema-typed, authenticated, metered. Publishers who get this will package the same underlying asset three ways. Humans get the magazine, LLMs get the structured index, agents get the endpoint. The ones who don’t will keep optimising headlines for a referral that stopped arriving.

What this means for the industry

A few uncomfortable predictions follow from the data.

SEO budgets get repurposed. Spending six figures on technical SEO when your crawl-to-refer ratio is 11,000:1 is rearranging deck chairs. Put that same headcount on building one MCP endpoint, instrumenting it, and signing a single commercial deal with an AI operator, and you’ll see measurable value inside a quarter.

Licensing becomes the norm. The era of crawler-as-trespasser is ending. Cloudflare’s pay-per-crawl experiments, OpenAI’s publisher deals with News Corp, the Atlantic, Vox Media and the Financial Times, and TollBit’s per-call billing are all converging on the same answer: Paid, authenticated, audit-logged access to context. The unauthenticated public crawl is turning into a legacy mode.

Mid-tier publishers either consolidate or specialise. General-interest publications without a defensible niche have a hard decade coming. Specialist publishers with deep, verifiable archives in law, medicine, finance, science and engineering are sitting on the most valuable agent-callable data anywhere, and most of them haven’t noticed yet.

The next big newsroom hire is a protocol engineer. Editorial and product engineering will need a new role sitting between them: Someone who decides which slices of the archive become which tools, how the schemas are designed, what counts as a billable query, and how editorial values get written into the endpoint contract. That job doesn’t exist on most mastheads today.

Agent-side branding becomes a thing. When an agent answers “what did the FT say about the European banking stress tests?” and cites a metered FT MCP call, the citation inside that answer is the new front page. Publishers will fight over how their brand shows up in agent contexts the same way they fought over Google snippets ten years ago.

A short playbook

If you publish anything, three moves are worth making this quarter.

First, measure your own ratio. Cloudflare publishes aggregate numbers; you can produce your own. Pull your server logs, segment by user-agent, count the pages you serve to ClaudeBot, GPTBot, Google-Extended and PerplexityBot, and divide by the referrals each one sends back. The result will probably be worse than you’d guess. It’ll also be the cleanest metric you have for the next two years of strategy conversations.

Second, ship an llms.txt, but for the right reason. Not because it’ll help SEO. It won’t, not yet. Ship it because it forces you to inventory your content as structured knowledge instead of as URLs. Deciding what belongs in the index, how to describe each section in machine-readable prose, and which paths actually carry your value teaches you more than the file itself ever will.

Third, prototype one MCP server, narrow and useful. Pick a slice of your archive someone would pay to query: Pricing data, regulatory filings, product specs, verified quotes. A working endpoint with a manual auth layer and ten tools will teach you more about this new distribution layer than a year of think pieces. Once it’s live, plug it into Claude, ChatGPT and Cursor and watch what the agents actually do with it.

Run those three plays this quarter and you’ll be further along than 95% of the publishing industry by year-end. The technical lift is small. The conceptual one, accepting that the website is no longer the product, is where most of the work actually is.

Where the next advantage sits

Read the Cloudflare data and the conclusion is hard to avoid. The web’s original economic contract has expired, and nobody has written the replacement yet. The 60,000:1 number is the ugliest sign that the old deal is dead. The 10,000 active MCP servers, the 97 million monthly SDK downloads, the platinum-tier publishers pulling up a chair at the Linux Foundation, the licensing deals being signed quietly through Q2: Those are the first sketches of the new one.

Here’s roughly how it reads. Humans subscribe to whichever surface they like best, agents query the endpoint, and the thing you charge for is access to context. The website doesn’t disappear. It gets demoted. It becomes the fallback UI for the people who still want to scroll a page, while the real distribution layer runs underneath it.

Publishers who get there early will set the price, the schema and the brand conventions for a decade of agent traffic. The ones who keep insisting the website is the product will spend that same decade watching their crawl-to-refer ratio climb and their referrals fall, wondering why fiddling with the homepage hero stopped working.

Nobody major is building that layer yet. That’s the gap, and the gap is where the trade is.

Sources and further reading:

Share this post

Want structured YouTube intelligence?

Content gap analysis, title scoring, thumbnail intelligence, and hook classification. Delivered via API and MCP server.

Get your free API key →