Blog

The latest on technical enforcement of crawling preferences.

13 min read

AI Training Data Lawsuits: Where 2026 Landed

Year-end tracker of the AI training data docket. Bartz v. Anthropic settled $1.5B, Andersen v. Stability went to jury trial, Kadrey v. Meta produced the most-cited fair-use ruling, Reddit v. Perplexity opened the scraper-intermediary front. Reference post with the comprehensive case table.

AI training data lawsuits 2026Bartz Anthropic settlementAndersen Stability AI trial
9 min read

What Defensive Coordination Actually Looks Like in 2026

A year after Poison Fountain launched anonymously, no AI lab has acknowledged it, no publisher has named it, and no one has measured it. Compared against Anubis, AIPREF, and Cloudflare Pay Per Crawl, the contrast shows what real defensive coordination requires.

Poison Fountain one yeardefensive AI coordinationAnubis Anubis defense
18 min read

Scrapling and Crawlee: How Open-Source Scraping Tools Get Detected

A technical analysis of Scrapling and Crawlee, two popular open-source scraping frameworks, examining their anti-detection features and the behavioral signals that content-layer defenses can exploit.

ScraplingCrawleeopen source web scraping
11 min read

GPAI After Six Weeks: Training-Data Disclosure Reaches Enforcement

August 2 2026 turned on the EU AI Office's fining power for general-purpose AI obligations. The training-data summary template requires labs to name 'most relevant domain names' — the first time AI providers must publicly disclose where they crawled. Six weeks in, what's visible.

EU AI Act GPAIArticle 53 training data summaryAI Office enforcement
15 min read

How AI Scraping Infrastructure Works: Proxies, Evasion, and Scale

Inside the technical infrastructure AI companies use to scrape the web: residential proxy networks, fingerprint emulation, CAPTCHA solving, and why traditional defenses fail.

AI web scrapingresidential proxy networksBright Data
9 min read

AI Crawler Compliance, Mid-2026: The Blocked-but-Cited Trap

Publishers that blocked AI crawlers via robots.txt lost 23.1% of monthly traffic on average — and got only weakly correlated reductions in AI citation. The H1 2026 data inverts the case for blocking-as-defense.

AI crawler compliancerobots.txt blocked but citedZhao Berman publishers
15 min read

Where AI Training Data Actually Comes From in 2026

A canonical reference for the six-layer AI training-data stack: Common Crawl, lab crawlers, curated open datasets, licensed feeds, contractor pipelines, and synthetic data. With the comprehensive licensing-deal table, current numbers, and what the labs do not disclose.

AI training data sourcesCommon Crawl 2026AI licensing deals
10 min read

How Much Does It Cost to Scrape the Web at Scale?

Bulk residential proxy pricing, Web Unlocker tiers, and headless browser farms put real per-page scraping costs at $0.001-$0.005, not the widely-quoted $0.01. AI training-data licensing deals show why the economics keep working for scrapers.

web scraping costresidential proxy pricing 2026Bright Data pricing
23 min read

Data Poisoning FAQ: Technical, Legal, and Policy Answers

Answers to common questions about data poisoning, web crawling, robots.txt, AIPREF, legal status, and enforcement mechanisms for AI training defense.

data poisoning FAQrobots.txt AI crawlersAIPREF explained
8 min read

Anubis at One Year: What Production Operators Are Actually Reporting

A year of public Anubis deployments yields concrete operator numbers, a Codeberg cautionary tale, and a project trajectory shift toward layered defenses. What the data says about proof-of-work anti-scraping.

Anubisproof-of-work anti-scrapingAnubis deployment data
27 min read

Publisher Defenses Against AI Scraping: Cost Imposition vs Poisoning

Comparing defense strategies against AI scraping: proof-of-work systems impose costs, data poisoning degrades value. Who pays and what works for publishers.

AI scraping defenseAnubis proof-of-workpublisher AI defense
27 min read

AI Poisoning Threat Models: Backdoors, RAG, and Supply Chain

Backdoor attacks, model degradation, and RAG poisoning explained. Technical analysis of who can attack, defense costs, and power dynamics in AI training data.

AI poisoning threat modelsbackdoor attacks AIRAG poisoning
8 min read

AIPREF After Toronto: What the IETF Decided in April

The IETF AIPREF working group reached consensus on AI training scope at its April 2026 Toronto interim, made progress on AI search wording, and deferred the contested AI input category. Status update on the standard.

AIPREFIETF AIPREFAI training preferences
10 min read

Defensive Data Poisoning: Ethics, Risks, and Alternatives

Analyzing ethical tradeoffs of defensive data poisoning: proportionality, collateral damage, and safer alternatives like proof-of-work and AIPREF standards.

defensive poisoning ethicsdata poisoning collateral damageAnubis proof-of-work
7 min read

What Is Data Poisoning in Machine Learning?

Data poisoning manipulates AI training data to alter model behavior. Learn how defensive tools like Nightshade protect content from unauthorized AI training.

data poisoningAI data poisoningmachine learning poisoning
10 min read

The AI Crawler Compliance Crisis: Who Plays by the Rules?

AI crawler robots.txt compliance dropped from 96.7% to 70% in one year. Analysis of which crawlers comply, what it costs publishers, and what comes next.

AI web crawlingrobots.txt complianceAI scraping
17 min read

Understanding AIPREF: The IETF Standard for AI Content Preferences

AIPREF extends robots.txt with standardized vocabulary for AI training preferences. How the IETF standard works, its syntax, and what it means for publishers.

AIPREFIETF AIPREFAI preferences standard
12 min read

Why VENOM Exists: From robots.txt to AI Data Enforcement

When robots.txt fails, enforcement mechanisms emerge. VENOM analyzes data poisoning, proof-of-work, and technical countermeasures for AI training governance.

AI data enforcementenforcement vs signalingrobots.txt compliance