AI Crawler Compliance, Mid-2026: The Blocked-but-Cited Trap
Key Takeaways
- Two H1 2026 studies invert the case for blocking AI crawlers via robots.txt. Zhao and Berman (Rutgers/Wharton, SSRN, Dec 2025) show publishers that blocked LLM crawlers experienced a persistent 23.1% decline in monthly visits and a 13.9% decline in human-only browsing. BuzzStream (March 2026, 4M citations) shows AI citation rates only weakly correlate with whether a publisher allowed crawling12
- Cloudflare Radar Q1 2026: AI crawlers now make up ~22% of all bot traffic. Of that AI bot traffic, 89.4% is training or mixed-purpose, only 8% is search-related, and just 2.2% serves actual user queries3
- TollBit’s Q1 2026 quarterly was not yet published as of mid-2026. The Q4 2025 baseline (30% robots.txt non-compliance) remains the most recent industry-wide figure
- AIPREF production deployment remains effectively zero. The vocab-05 draft expired June 4 2026; vocab-06 (post-Toronto) is the live document. A competing individual draft (
draft-romm-aipref-contentsignals) is now circulating, indicating vocabulary debate is not settled4 - Reddit v. Perplexity discovery is active. Reddit’s filings allege SerpApi accessed ~2B Google search results containing Reddit content; Oxylabs ~781M; AWMProxy ~482M. Perplexity denies the allegations5
Where We Are Six Months In
The March 2026 piece on AI crawler compliance (The AI Crawler Compliance Crisis) tracked the trajectory through Q4 2025: TollBit data showed robots.txt non-compliance climbing from 3.3% in Q4 2024 to 30% in Q4 2025, with a fourfold increase in sites adding GPTBot to disallow lists. The implicit recommendation was: keep blocking, signaling matters, the next decision points are IETF 125 Shenzhen (March) and the Toronto interim (April).
Six months later, two things complicate that recommendation. The headline quarterly numbers are stale: TollBit’s Q1 2026 quarterly hasn’t shipped, and Q2 2026 likely won’t be public until late September. And two H1 studies measure something the March post couldn’t: what blocking actually buys publishers.
Both studies reach the same conclusion by different methods. Blocking AI crawlers reduces traffic. It does not reliably reduce AI citation. The trade publishers thought they were making is not the trade they got.
What the H1 Studies Found
Zhao and Berman (Rutgers Business School / Wharton, SSRN, December 31 2025) ran a staggered difference-in-differences analysis covering October 2022 through June 2025. Publishers that blocked LLM crawlers via robots.txt experienced a persistent 23.1% decline in log-monthly visits (SimilarWeb) and a 13.9% decline in human-only browsing (Comscore). The effect held across publisher categories and persisted over time. This is not a transient adjustment1.
The 23.1% number is large and asymmetric. Blocking is a binary decision; the traffic loss is continuous. A publisher that blocks loses real revenue. A publisher that doesn’t block has no comparable lever to pull.
BuzzStream (March 19, 2026) studied 4 million citations across 3,600 prompts in AI answer engines. They found citation rates only weakly correlated with whether the publisher allowed crawling. The mechanisms identified: Common Crawl historical archives, robots.txt non-compliance among AI bots themselves, and search-API intermediaries (SerpApi-class) that decouple “did the publisher allow this bot” from “did the AI engine cite this content”2.
A third data point: the Reuters Institute Digital News Report (January 2026) put Google Search referrals to news publishers down 38% in the US. Web search’s share of overall referrals fell from 51% to 27% between 2023 and Q4 2025. The traffic problem is not only AI scraping. The search-traffic baseline is collapsing simultaneously.
Blocking is a costly action that does not produce the benefit publishers expected.
What Cloudflare Radar Shows
Cloudflare Radar’s Q1 2026 numbers describe the bot ecosystem publishers are actually facing3:
- AI crawlers account for ~22% of all bot traffic, second-largest category behind search engines
- Of that AI traffic, 89.4% is training or mixed-purpose, 8% search-related, 2.2% user-query response
- GPTBot’s share of AI bot traffic declined within Q1 from 12.13% to 11.05%
- Of robots.txt-equipped domains studied, GPTBot is disallowed by 476 of 4,055 domains (~11.7%) in the Cloudflare slice
The GPTBot share decline does not mean OpenAI is scraping less. ChatGPT-User, the RAG/inference-time bot distinct from GPTBot, remained the worst offender on per-page scrape rate through Q4 2025 (5x Meta and 16x Perplexity). The composition of AI bot traffic is shifting from training-time scrapes to inference-time retrieval. That shift sits outside the bot-name blocklists most operators have built around.
Meta-ExternalAgent continues to lead total share. Googlebot reclassification affects the numbers: Cloudflare now treats Googlebot as partially AI-purpose due to AI Overviews and Gemini grounding, which has increased “AI bot” totals without a real volume change.

The Discovery Filings That Matter
Reddit v. Perplexity, SerpApi, Oxylabs, and AWMProxy (filed SDNY, October 2025) entered active discovery in early 2026. Reddit’s expert reports allege the following scraping volumes5:
- SerpApi allegedly accessed ~2 billion Google search results containing Reddit content
- Oxylabs allegedly accessed ~781 million
- AWMProxy allegedly accessed ~482 million
Perplexity’s February 2026 filing denies the scraping allegations. No ruling yet. The volume figures are in the public docket regardless, and they will appear in coverage of every adjacent compliance question.
This formalizes a hypothesis the BuzzStream data already implies. AI citation does not require AI training scrapers to crawl publisher content directly. It can route through Google search results the publisher has not opted out of, then through a SerpApi-class intermediary, then to an answer engine. Publisher blocking of GPTBot catches none of that chain.
The AIPREF Status Check
The standard publishers were told to track is still in draft. As of mid-2026:
draft-ietf-aipref-vocab-05expired June 4 2026; the post-Toronto vocab-06 is the active document. Working consensus on AI training scope reached at the April 14-16 Toronto interim6- The companion attach draft (
draft-ietf-aipref-attach) remains expired since October 2025 - A competing individual draft,
draft-romm-aipref-contentsignals, is circulating, indicating the vocabulary debate is not closed - IETF 126 in Vienna (July 18-24 2026) is the next decision point
Production AIPREF deployment is effectively zero. No publisher count exists. Cloudflare’s “Managed Robots.txt” (October 2025) propagates blocklists at platform scale but uses legacy User-agent / Disallow vocabulary, not Content-Usage headers. Fastly’s early-2026 “The Truth About Blocking AI” offers AI bot management but no Content-Usage tooling. Search Engine Land reported in March 2026 that managed-WordPress hosts are silently blocking AI bots without admin visibility. That is the opposite of standardized signaling.
For background on the standard’s structure and the Toronto outcomes, see AIPREF After Toronto.
What Cloudflare and Stack Overflow Did Build
Cloudflare’s Agents Week 2026 (April) launched two production features7:
- Redirects for AI Training: serves a canonical content version to verified AI crawlers. This is not blocking. It is “if you crawl, here is the version we want represented.”
- AI Crawl Control went GA. Pay Per Crawl remains in private beta; Stack Overflow is the most prominent adopter (announced February 2026)8.
The Cloudflare WAF release on 2026-04-21 extended AI bot heuristics. Whether it materially shifts the compliance picture for Cloudflare-fronted sites in H2 2026 is the open question.
What Publishers Actually Have to Decide
The H1 2026 data redraws the publisher decision tree.
Blocking has a measurable cost. The Zhao/Berman 23.1% traffic decline is the most-cited number now and will appear in every publisher’s internal deck through 2026. A publisher choosing to block AI crawlers is choosing to absorb that cost.
Blocking does not reliably reduce AI citation. The BuzzStream finding plus the Reddit v. Perplexity volume disclosures show why. Crawl access is not the chokepoint citation flows through.
The signaling-only framing has weakened. The March piece argued that even imperfect signaling matters because regulators and courts use it as evidence of intent. That argument still holds. It now has to compete with a 23.1% revenue cost.
The implication is the same one we’ve outlined elsewhere: if cost-imposition can’t close the gap and signaling alone has weakened, defense logic has to extend to the value side of the ledger. For the unit economics behind why pure blocking will not work, see How Much Does It Cost to Scrape the Web at Scale? and Cost Imposition vs Value Degradation.
What’s Still Coming
Three things that will reshape this post if they land before year-end:
- TollBit Q2 2026 State of the Bots (expected late September): the next industry-wide non-compliance baseline. It will determine whether the 30% Q4 2025 figure was an inflection or a peak.
- Reddit v. Perplexity ruling or motion outcomes: discovery is producing public material. Any ruling will be cited as precedent in adjacent cases.
- AIPREF vocab-06 to IESG submission: target was August 31 2026. Whether that hits or slips determines whether publishers have a standardized signal by year-end.
The H1 2026 picture is the one publishers have to plan against. The trade is worse than it looked in March. The defenses that work are not the ones publishers were told to deploy.
Last updated: August 2026
References
- Zhao, R. and Berman, R. (December 31 2025). "Blocking LLM Crawlers and Publisher Traffic." SSRN working paper. Summary coverage: https://ppc.land/blocking-ai-crawlers-backfired-news-publishers-lost-23-of-traffic/ (search SSRN by author for primary)
- BuzzStream (March 19 2026). "Blocking AI Crawlers Doesn't Stop Citations: New Data Shows Why." https://ppc.land/blocking-ai-crawlers-doesnt-stop-citations-new-data-shows-why/
- Cloudflare Radar Q1 2026 bot statistics, accessed May 2026. The Radar dashboard is live; figures cited reflect Q1 2026 snapshot. https://radar.cloudflare.com/bots
- IETF AIPREF Working Group. https://datatracker.ietf.org/wg/aipref/about/
- Reddit v. Perplexity / SerpApi / Oxylabs / AWMProxy (SDNY, Oct 2025). Coverage of discovery filings: https://searchengineland.com/reddit-sues-perplexity-serpapi-scraping-google-463681
- Keller, P., Thomson, M. "A Vocabulary For Expressing AI Usage Preferences." draft-ietf-aipref-vocab-06. https://datatracker.ietf.org/doc/draft-ietf-aipref-vocab-06
- Cloudflare Blog (Agents Week 2026). "Redirects for AI Training." https://blog.cloudflare.com/ai-redirects/
- Stack Overflow + Cloudflare (Feb 2026). "Pay Per Crawl Adoption." https://stackoverflow.blog/2026/02/19/stack-overflow-cloudflare-pay-per-crawl/
- Reuters Institute (Jan 2026). "Digital News Report 2026." https://reutersinstitute.politics.ox.ac.uk/
- The Register (Feb 4 2026). "AI Bot Traffic Closing in on Human Web Visits." https://www.theregister.com/2026/02/04/ai_bot_traffic_web_browsers/