Where AI Training Data Actually Comes From in 2026

Key Takeaways

The 2026 frontier-model training stack has six functional layers: open web crawls, curated open datasets, licensed publisher and platform feeds, code corpora, contractor-generated data, and synthetic data. Most frontier models draw from all six¹
Composition shares for any frontier model after GPT-3 are not public. GPT-3’s disclosed 60% Common Crawl share remains the only hard public datapoint. GPT-4, GPT-5, Claude Opus/Sonnet 4.x, Gemini 2.5/3, Llama 4, and Grok all reference “publicly available data” without numeric breakdowns²
Common Crawl ran ~2.0-2.2B pages and 344-379 TiB per monthly snapshot in early 2026. Total archive: 9.5+ PB and 300B+ pages since 2008. CCBot is the second-most-blocked AI agent on top-10K domains, behind GPTBot³
Bartz v. Anthropic settled $1.5B in September 2025 — roughly $3,000 per book across ~500K works. The court drew a line: training on legally acquired books is fair use, pirated acquisition is not. In Kadrey v. Meta, Judge Chhabria granted Meta summary judgment on fair use as to the 13 named plaintiffs on June 25, 2025, while leaving other authors free to sue⁴⁵
The contractor labor market — Surge AI ($1.2B revenue, $25B valuation), Scale AI ($29B post-Meta), Mercor ($450M+ ARR) — is now plausibly comparable in spend to a meaningful fraction of GPU training compute⁶⁷

The Six-Layer Stack

Frontier-model training data in 2026 is no longer a single pipeline. It is six layers running in parallel, with the center of gravity moving away from raw web crawls and toward licensed feeds, contractor-generated data, and synthetic data.

The layers:

Open web crawls — Common Crawl plus lab-operated crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, OAI-SearchBot)
Curated open datasets — FineWeb, FineWeb-Edu, RefinedWeb, Dolma, RedPajama-V2, The Stack v2; legacy: The Pile, C4
Licensed publisher and platform feeds — News Corp, Reddit, Shutterstock, AP, FT, Vox, Axel Springer, Stack Overflow, and dozens more
Code corpora — The Stack v2 (BigCode), GitHub-derived crawls, internal lab code archives
Human-curated and contractor-generated data — RLHF preference data, SFT instruction data, expert demonstrations, red-team data from Surge, Scale, Mercor, Outlier, Invisible, Toloka
Synthetic and self-generated data — model-generated rephrasings, distillation traces, reasoning traces, multimodal pairs

The narrative people still tell, “AI is trained on the web,” was accurate for GPT-3. It is misleading for any frontier model in 2026. Public web text is no longer the binding constraint. Quality, reasoning traces, and verifiable answers are.

Six-layer AI training data stack: open web crawls, curated open datasets, licensed feeds, code corpora, contractor-generated data, synthetic data — with frontier models drawing from all six

Common Crawl in 2026

Common Crawl remains the foundation of every public dataset. Its current numbers:

Total archive: 9.5+ PB cumulative, 300B+ pages captured since 2008
Monthly cadence: ~2.0-2.2B pages and ~344-379 TiB uncompressed per snapshot. March 2026 ran 1.97B pages / 344.64 TiB; April 2026 ran 2.19B pages / 379.2 TiB
Web graph: Combined Feb-Apr 2026 graph has 269M host nodes and 9.4B edges; 124.6M domain nodes and 4.8B edges³

The blocking trajectory is the more interesting story. Reputable-site AI blocking rose from 23% in September 2023 to roughly 60% by May 2025 per a public arXiv study. By August 2025, Cloudflare reported 2.5M+ websites had opted out of AI crawling. CCBot (Common Crawl’s bot) is the second-most-blocked AI agent on top-10K domains, behind only GPTBot⁸.

Two important caveats. First, blocking CCBot prevents future inclusion but does not remove content from prior snapshots. AI labs retain access to historical dumps regardless of current robots.txt posture. Second, publishers blocking AI crawlers via robots.txt experienced a roughly 23.1% monthly visit decline in early 2026, with no corresponding drop in AI citations. Blocking exits the pipeline going forward but loses the traffic too.

The frontier-model share question (what percentage of any specific 2026 model’s training is Common Crawl) is unanswerable from public data. Mozilla Foundation’s analysis showed Common Crawl was 60% of GPT-3’s weighted training mix. No equivalent disclosure exists for any successor. Best inference: Common Crawl (or CC-derived sets like FineWeb / RefinedWeb) remains a foundational input but its proportion has fallen as licensed feeds and synthetic data have grown. The exact share is unknown.

What Each Lab Discloses

What each lab discloses, drawn from public model cards, court filings, and announced deals:

OpenAI. GPT-3 (2020) was 60% Common Crawl filtered, 22% WebText2, 16% Books1+Books2, 3% Wikipedia. GPT-4 and GPT-5 have no composition disclosed. Crawlers: GPTBot, OAI-SearchBot, ChatGPT-User. OpenAI internally acknowledged using Whisper to transcribe 1M+ hours of YouTube as ToS-violating; class actions are ongoing⁹. 18+ publisher deals globally.
Anthropic. Stanford’s Foundation Model Transparency Index (December 2025) confirms five training inputs: publicly available internet data (cutoff March 2025 for Opus 4 / Sonnet 4), third-party non-public data, contractor-labeled data, opted-in user data (policy from September 2025), and internally generated data. Crawlers: ClaudeBot, anthropic-ai. The Bartz settlement (see below) is the defining input on books¹⁰.
Google / DeepMind. Gemini draws from web crawl, licensed corpora, Google product data, YouTube transcripts and frames, and synthetic data. Google confirmed the YouTube use to CNBC in June 2025: a “subset” of videos, used for both Gemini and Veo 3. Google-Extended is the AI-training opt-out, separate from search indexing. The Reddit deal ($60M/yr, February 2024) is in renewal negotiation with Reddit pushing for dynamic pricing. Springer Nature licensed for $23M one-time (July 2024)¹¹.
Meta. Llama (disclosed 2023) trains on Common Crawl via CCNet, plus C4, GitHub, Wikipedia (20 languages), Project Gutenberg, Books3, ArXiv, and Stack Exchange. Court filings in Kadrey v. Meta (unsealed March 2024) showed CEO sign-off on a LibGen download of 7.5M+ books and engineer scripts to strip copyright notices. On June 25, 2025, Judge Chhabria granted Meta summary judgment on fair use as to the 13 named plaintiffs, while leaving other authors free to sue⁵. Meta acquired a 49% stake in Scale AI for $14.8B in June 2025 ($29B implied valuation). A multi-publisher push in late 2025 added People Inc, CNN, Fox News, Fox Sports, USA Today network, Le Monde, and others; the News Corp deal runs up to $50M/yr (March 2026)¹².
xAI. Minimal disclosure. Grok trains on the X corpus (proprietary, real-time), web crawl, and synthetic data per public statements. No publisher deals publicly disclosed as of May 2026.
Apple. Applebot crawl (“hundreds of billions of pages”), licensed publisher data, curated open-source datasets, and synthetic data. Does not use private user data. Applebot-Extended is the AI-training opt-out. Named in Books3 / YouTube Subtitles-derived suits via The Pile.
Mistral. No historical dataset disclosure. Under EU AI Act, Mistral Large must publish a summary of copyrighted training data. AFP multi-year deal (2025) supplies Le Chat with 2,300 stories per day across six languages.

The Licensing Deals Table

This is the most-requested artifact in coverage of AI training data, because it shows where the money has actually flowed. Public deals as of May 2026:

Counterparty	Lab	Value	Year	Scope
News Corp	OpenAI	$250M+ over 5 yrs	May 2024	WSJ, NY Post, Times of London, archives
News Corp	Meta	up to $50M/yr	Mar 2026	AI products
Reddit	Google	$60M/yr (renegotiating)	Feb 2024 → 2026	Real-time content for Gemini
Reddit	OpenAI	~$70M/yr (est.)	2024	Comparable structure
Axel Springer	OpenAI	”tens of millions EUR/yr”	Dec 2023	Politico, Bild, Business Insider
Financial Times	OpenAI	$5-10M/yr	Apr 2024	Full archive incl. paywall
Le Monde	OpenAI	undisclosed	Mar 2024	Full corpus
AP	OpenAI	undisclosed	Jul 2023	First major news deal
AP	Google	undisclosed	2025	Gemini real-time
Vox Media	OpenAI	undisclosed	May 2024	Vox, Verge, Eater, NY Mag
The Atlantic	OpenAI	undisclosed	May 2024	Articles + product input
Dotdash Meredith	OpenAI	$16M+/yr fixed	May 2024	People, Investopedia, Allrecipes
Reuters	(Meta likely)	undisclosed (multi-tier, inferred from filings)	2024	Inferred from filings
Springer Nature	Google	$23M one-time	Jul 2024	Academic
Wiley	undisclosed lab	$23M one-time	2024	Academic
Taylor & Francis	Microsoft	$10M upfront + recurring	2024	Academic
Shutterstock	OpenAI	up to $250M by 2027	2024	Visual; 6-yr
Shutterstock	Meta, Amazon, Google, Apple	$104M (2023) → $138M (2024)	Ongoing	Visual
Getty Images	Perplexity	undisclosed	Oct 2025	Image display
Stack Overflow	Google, OpenAI	undisclosed	2024	Q&A
AFP	Mistral	multi-year	2025	2,300 stories/day, 6 langs
The Guardian	OpenAI	undisclosed	2025	Citations + summaries
Washington Post	OpenAI	undisclosed	2025	Summaries, quotes, links
Schibsted Media	OpenAI	undisclosed	2025	Norwegian news
Axios	OpenAI	3-yr deal + newsroom funding	2025	First newsroom-funding deal
NYT	Amazon	undisclosed	2025	Alexa / Rufus (NYT separately suing OpenAI)
Condé Nast	Amazon	multi-year	2025	Rufus shopping assistant
Hearst	Amazon	multi-year	2025	Rufus shopping assistant
People Inc, CNN, Fox News, USA Today, Le Monde, etc.	Meta	undisclosed	Dec 2025	Multi-publisher push

OpenAI maintains 18+ publisher deals globally. Microsoft’s Publisher Content Marketplace (pilot September 2025) introduces usage-based royalties paid per token. Perplexity Revenue Share (July 2024) pays variable ad-revenue share to cited publishers. Reddit was the most-cited domain by Google AI Overviews and Perplexity from August 2024 to June 2025, which strengthens its negotiating leverage¹³.

Public Dataset Trajectories

Composition of the largest open datasets a non-frontier lab could train on today:

Dataset	Tokens	Year	Notes
FineWeb	18.5T (orig 15T)	Apr 2024	96 CC dumps 2013-2024; ODC-By 1.0
FineWeb-Edu	1.3T	2024	Quality-filtered FineWeb (~92% removed); matches MMLU of models trained on 10× more C4/Dolma tokens
FineWeb-2	multilingual	2025	Multilingual extension
RedPajama-V2	30T raw (~20T post-filter)	2023	84 CC crawls; largest by raw token count
Dolma (AI2)	3T → 5T+	2023-25	Web + academic + code + books; powers OLMo
RefinedWeb (TII)	600B public / 3-6T full	2023	CC + MDR pipeline
The Stack v2 (BigCode)	67.5TB / 3.3-4.3T training tokens	2024	600+ languages, permissive licenses
The Pile (legacy)	~340B	Dec 2020	Books3 DMCA’d Aug 2023
C4 (legacy)	~150B	2019	CC filtered

The largest open dataset in 2026 is RedPajama-V2 by raw tokens (30T) and FineWeb by clean tokens (18.5T). The most consequential finding is that quality-filtered subsets like FineWeb-Edu (1.3T tokens) demonstrably outperform 10× larger raw datasets on reasoning benchmarks. Frontier labs have moved decisively toward filter-then-mix pipelines over scale-only ingestion¹⁴.

Synthetic Data

Synthetic data is the layer growing fastest. The strongest public case study is Microsoft’s Phi-4 (December 2024): 400B synthetic tokens across 50 categories, 14B parameters, matches Llama-3.1-405B on reasoning benchmarks¹⁵. A separate scaling-law paper (arXiv 2510.01631) found that 1/3 rephrased synthetic plus 2/3 natural web speeds convergence by 5-10× to equal validation loss at large data budgets.

Gartner predicted in 2021 that by 2024 60% of data used in AI/analytics projects would be synthetic, up from 1% in 2021. This was a forecast rather than a measured figure, it includes non-frontier-pretraining uses, and it deserves caution. Epoch AI’s analysis projects exhaustion of quality public text between 2026 and 2032 with 80% confidence, with the effective stock estimated at ~300T tokens¹⁶.

For frontier pretraining specifically, no lab discloses synthetic share. Best-public-knowledge inference range for synthetic share of frontier pretraining plus post-training compute: 20-50%, growing. This range is an inference, not a disclosure.

The Contractor Layer

The fastest-growing and least-discussed input is human-curated data from contractor pipelines. Current revenue and valuation figures:

Vendor	Revenue / Run-rate	Valuation	Notes
Surge AI	$1.0-1.2B (2024)	$25B (Jul 2025)	Bootstrapped; surpassed Scale; 50K experts
Scale AI	$870M (2024), $2B target (2025)	$29B (Jun 2025)	Meta 49% stake
Mercor	$450-500M ARR (Oct 2025)	$10B (Series C)	30K+ experts; $1.5M/day to contractors
Snorkel AI	$36.8M (2024) → $148M (2025)	$1B (2021)	Programmatic labeling
Toloka AI	undisclosed	n/a	195+ countries; Yandex spinoff
Appen	~$233M (2025, roughly flat YoY)	down 99% from peak	Lost Google Jan 2024; in decline

The RLHF platform market is projected to grow from $2.8B in 2025 to $18.6B by 2034⁷. The structural shift is what matters. Pre-2023 frontier training was bottlenecked by tokens; post-2024 it is bottlenecked by quality tokens — meaning licensed data, expert-generated demonstrations, and synthetic / distilled reasoning. Compute spend on labeling contractors plus licensing is now plausibly comparable to a meaningful fraction of GPU spend at the marginal model.

Following the training-data money: Scale AI implied valuation 29 billion dollars after Meta bought 49 percent for 14.8 billion in June 2025; News Corp the top disclosed AI-licensing deal at up to 50 million dollars per year; Appen 2025 revenue about 233 million dollars and declining since losing Google

The Lawsuits That Set the Boundary

Two 2025 rulings define the legal posture for 2026.

Bartz v. Anthropic settled $1.5B in September 2025, preliminarily approved by Judge Alsup on September 25. The settlement covered ~500,000 books from LibGen and Pirate Library Mirror at roughly $3,000 per book. Anthropic must destroy the original torrented files. Judge Alsup’s reasoning matters more than the dollar figure: training on legally acquired books is fair use; pirated acquisition is not. The settlement does not release future-act claims or output-based claims⁴.

Getty Images v. Stability AI (UK High Court, November 4 2025) rejected secondary copyright infringement and confirmed Getty works were used in training, but ruled the model itself does not “store” the works as infringing copies. The court upheld trademark infringement for outputs containing Getty marks¹⁷.

Active in 2026: New York Times v. OpenAI / Microsoft; Britannica / Merriam-Webster v. OpenAI (March 2026); David Millette class actions vs. Google and OpenAI on YouTube transcription; Petryazhna v. OpenAI on YouTube. The Anthropic settlement is the precedent everyone is litigating around.

What’s Still Not Public

Six things the public record does not contain:

Composition shares for any post-GPT-3 frontier model. No numeric breakdown for GPT-4 / 5, Claude 4.x, Gemini 2.5 / 3, Llama 4, or Grok.
Licensing values for many deals: AP/OpenAI, Atlantic, Vox, Le Monde, Reuters, Apple’s publisher list, all 2025 Meta deals.
Synthetic data share of frontier pretraining. No lab discloses.
YouTube subset sizes Google uses for Gemini / Veo training.
Internal lab crawler corpora. GPTBot, ClaudeBot, Applebot-Extended dataset sizes never disclosed.
Books and academic sources beyond licensed deals. Strong inference of pirate-library ingestion across multiple labs; only Anthropic and Meta have litigated specifics.

What is blocked or restricted: News Corp content for non-licensed labs; Reddit for non-Google, non-OpenAI labs; paywalled academic journals; most major newsrooms post-2024; Cloudflare-managed sites under default-deny robots.txt; 2.5M+ sites that opted out of AI crawling.

What This Means

“Where AI training data comes from” in 2026 is a question about which deals, which contractors, and which synthetic pipelines — far more than “which crawl.” The center of gravity is no longer the open web. It is licensed publisher feeds, contractor-curated demonstrations, and self-generated reasoning traces. The web is still there in the foundation, but it is a smaller share of what makes a frontier model better than its predecessor.

For publishers, the implication is that the licensing path matters more than the blocking path. Blocking AI crawlers via robots.txt prevents future inclusion but loses the traffic too. Licensing extracts value but binds the lab. The choice is strategic, not technical. For background on the standardized signaling layer that makes both paths cleaner, see Understanding AIPREF and AIPREF After Toronto. For the unit economics behind why scraping continues alongside licensing, see How Much Does It Cost to Scrape the Web at Scale?.

Last updated: June 2026

References

Common Crawl Foundation. "Overview." https://commoncrawl.org/overview
Mozilla Foundation. "Common Crawl and Generative AI." https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/
Common Crawl. "Crawl Statistics 2026." https://commoncrawl.github.io/cc-crawl-statistics/
NPR (Sept 2025). "Anthropic's $1.5B Authors Settlement." https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material
TechCrunch (Jan 2025). "Zuckerberg Approved LibGen for Llama, Filing Claims." https://techcrunch.com/2025/01/09/mark-zuckerberg-gave-metas-llama-team-the-ok-to-train-on-copyrighted-works-filing-claims/
Sacra. "Surge AI Revenue and Valuation." https://sacra.com/c/surge-ai/
TechCrunch (Oct 2025). "Mercor Quintuples Valuation to $10B in Series C." https://techcrunch.com/2025/10/27/mercor-quintuples-valuation-to-10b-with-350m-series-c/
Cloudflare. "AI Bot Blocking Statistics 2025." https://blog.cloudflare.com/
New York Times (Apr 2024). "How Tech Giants Cut Corners to Harvest Data for AI." YouTube Whisper transcription coverage. https://www.nytimes.com/
Stanford HAI / FMTI (Dec 2025). "Anthropic Transparency Report." https://crfm.stanford.edu/fmti/December-2025/company-reports/Anthropic_FinalReport_FMTI2025.html
CNBC (Jun 2025). "Google Used YouTube Videos to Train Gemini and Veo 3." https://www.cnbc.com/2025/06/19/google-youtube-ai-training-veo-3.html
Press Gazette. "People Inc Signs AI Licensing Deal with Meta." https://pressgazette.co.uk/north-america/people-inc-signs-ai-licensing-deal-with-meta/
Columbia Journalism Review. "Reddit's AI Licensing Position." https://www.cjr.org/analysis/reddit-winning-ai-licensing-deals-openai-google-gemini-answers-rsl.php
HuggingFace. "FineWeb Dataset." https://huggingface.co/datasets/HuggingFaceFW/fineweb
Microsoft Research (Dec 2024). "Phi-4 Technical Report." https://www.microsoft.com/en-us/research/wp-content/uploads/2024/12/P4TechReport.pdf
Epoch AI. "Will We Run Out of Data?" https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
Mayer Brown (Nov 2025). "Getty Images v. Stability AI: What the High Court's Decision Means." https://www.mayerbrown.com/en/insights/publications/2025/11/getty-images-v-stability-ai-what-the-high-courts-decision-means-for-rights-holders-and-ai-developers
Authors Guild. "What Authors Need to Know About the Anthropic Settlement." https://authorsguild.org/advocacy/artificial-intelligence/what-authors-need-to-know-about-the-anthropic-settlement/
Digiday. "Timeline of Major Publisher-AI Deals 2025." https://digiday.com/media/a-timeline-of-the-major-deals-between-publishers-and-ai-tech-companies-in-2025/
Together AI. "RedPajama-V2 Announcement." https://www.together.ai/blog/redpajama-data-v2