Blog
The latest on technical enforcement of crawling preferences.
AI Training Data Lawsuits: Where 2026 Landed
Year-end tracker of the AI training data docket. Bartz v. Anthropic settled $1.5B, Andersen v. Stability went to jury trial, Kadrey v. Meta produced the most-cited fair-use ruling, Reddit v. Perplexity opened the scraper-intermediary front. Reference post with the comprehensive case table.
What Defensive Coordination Actually Looks Like in 2026
A year after Poison Fountain launched anonymously, no AI lab has acknowledged it, no publisher has named it, and no one has measured it. Compared against Anubis, AIPREF, and Cloudflare Pay Per Crawl, the contrast shows what real defensive coordination requires.
Scrapling and Crawlee: How Open-Source Scraping Tools Get Detected
A technical analysis of Scrapling and Crawlee, two popular open-source scraping frameworks, examining their anti-detection features and the behavioral signals that content-layer defenses can exploit.
GPAI After Six Weeks: Training-Data Disclosure Reaches Enforcement
August 2 2026 turned on the EU AI Office's fining power for general-purpose AI obligations. The training-data summary template requires labs to name 'most relevant domain names' — the first time AI providers must publicly disclose where they crawled. Six weeks in, what's visible.
How AI Scraping Infrastructure Works: Proxies, Evasion, and Scale
Inside the technical infrastructure AI companies use to scrape the web: residential proxy networks, fingerprint emulation, CAPTCHA solving, and why traditional defenses fail.
AI Crawler Compliance, Mid-2026: The Blocked-but-Cited Trap
Publishers that blocked AI crawlers via robots.txt lost 23.1% of monthly traffic on average — and got only weakly correlated reductions in AI citation. The H1 2026 data inverts the case for blocking-as-defense.
Where AI Training Data Actually Comes From in 2026
A canonical reference for the six-layer AI training-data stack: Common Crawl, lab crawlers, curated open datasets, licensed feeds, contractor pipelines, and synthetic data. With the comprehensive licensing-deal table, current numbers, and what the labs do not disclose.
How Much Does It Cost to Scrape the Web at Scale?
Bulk residential proxy pricing, Web Unlocker tiers, and headless browser farms put real per-page scraping costs at $0.001-$0.005, not the widely-quoted $0.01. AI training-data licensing deals show why the economics keep working for scrapers.
Data Poisoning FAQ: Technical, Legal, and Policy Answers
Answers to common questions about data poisoning, web crawling, robots.txt, AIPREF, legal status, and enforcement mechanisms for AI training defense.
Anubis at One Year: What Production Operators Are Actually Reporting
A year of public Anubis deployments yields concrete operator numbers, a Codeberg cautionary tale, and a project trajectory shift toward layered defenses. What the data says about proof-of-work anti-scraping.
Publisher Defenses Against AI Scraping: Cost Imposition vs Poisoning
Comparing defense strategies against AI scraping: proof-of-work systems impose costs, data poisoning degrades value. Who pays and what works for publishers.
AI Poisoning Threat Models: Backdoors, RAG, and Supply Chain
Backdoor attacks, model degradation, and RAG poisoning explained. Technical analysis of who can attack, defense costs, and power dynamics in AI training data.
AIPREF After Toronto: What the IETF Decided in April
The IETF AIPREF working group reached consensus on AI training scope at its April 2026 Toronto interim, made progress on AI search wording, and deferred the contested AI input category. Status update on the standard.
Defensive Data Poisoning: Ethics, Risks, and Alternatives
Analyzing ethical tradeoffs of defensive data poisoning: proportionality, collateral damage, and safer alternatives like proof-of-work and AIPREF standards.
What Is Data Poisoning in Machine Learning?
Data poisoning manipulates AI training data to alter model behavior. Learn how defensive tools like Nightshade protect content from unauthorized AI training.
The AI Crawler Compliance Crisis: Who Plays by the Rules?
AI crawler robots.txt compliance dropped from 96.7% to 70% in one year. Analysis of which crawlers comply, what it costs publishers, and what comes next.
Understanding AIPREF: The IETF Standard for AI Content Preferences
AIPREF extends robots.txt with standardized vocabulary for AI training preferences. How the IETF standard works, its syntax, and what it means for publishers.
Why VENOM Exists: From robots.txt to AI Data Enforcement
When robots.txt fails, enforcement mechanisms emerge. VENOM analyzes data poisoning, proof-of-work, and technical countermeasures for AI training governance.