Semiautonomous
Systems

Rules of engagement for AI crawlers that sites can actually enforce.

The shift

AI crawlers extract value from public content without sending users back. Major AI companies scrape billions of pages to train AI systems that compete directly with content creators. The economics are inverted: content creators bear hosting and bandwidth costs while AI companies extract value without compensation or even attribution.

Publishers like The New York Times and CNN have responded by blocking crawlers like OpenAI's GPTBot and Common Crawl's CCBot. But blocking is binary and blunt. You can't say "yes for search indexing (like Google indexing pages for search results), no for training AI systems." The tools available today don't give sites real leverage.

The robots.txt file is a standard that lets sites declare which crawlers can access which parts of their site. But it's voluntary and can be ignored with no technical consequence. There's no way to express nuanced preferences like "search vs training." Standard legal language in terms of service doesn't change crawler behavior, and violations are hard to detect and harder to prove.

Existing tools fail in three ways:

  • No enforceable rules: robots.txt is voluntary
  • No instrumentation: Can't express nuanced preferences
  • No leverage: Standard legal language doesn't change behavior

Our thesis

Semiautonomous Systems builds infrastructure that lets sites set, enforce, and measure rules of engagement for AI crawlers. The goal is to rebalance power so creators and publishers have a say in how their content is used.

We believe that rules of engagement should be expressible, enforceable, and measurable. Sites should be able to declare nuanced preferences beyond simple "allow" or "disallow."

These preferences need technical controls at the network edge (at CDN or proxy layers, before requests reach your servers) that make violating rules costly and unattractive. Honest, authorized access should become easier and cheaper than cheating. And there must be instrumentation that tracks compliance, detects violations, and provides evidence for accountability.

This shifts the economics: instead of "scrape everything and ask forgiveness later," crawlers must respect declared preferences or face technical consequences. Sites gain leverage through technical enforcement, not just legal text.

  • Expressible: Sites can declare nuanced preferences beyond simple allow or disallow.
  • Enforceable: Technical controls at the network edge make violating rules costly and unattractive.
  • Measurable: Instrumentation tracks compliance, detects violations, and provides evidence for accountability.

From "please don't" to technical enforcement.

What we're building

We ship infrastructure that makes rules of engagement real through standards, technical controls, measurement, and detection.

  • Standards: Contributing to the development of open standards that let sites express preferences and give crawlers a clear path to compliance.
  • Technical controls: Enforcement at CDN (Content Delivery Network) or reverse proxy layers that stops violators before they reach your application. This keeps response times fast for real users while enforcing your preferences.
  • Measurement: Analytics showing who's crawling, how often, and whether they're respecting your preferences. This evidence supports accountability and helps you optimize your rules over time.
  • Detection: Multi-signal fingerprinting identifies AI crawlers even when they rotate user agents (the identifier browsers send) or use residential proxies (hiding behind home IP addresses). Behavioral analysis catches stealth patterns that simple detection methods miss.

Our first product, VENOM, brings these ideas to individual content surfaces.

We're building this infrastructure with design partners: content sites, publishers, and platforms who want to experiment with rules of engagement and anti-scraping strategy. If you're dealing with unauthorized AI crawlers, want to enforce preferences that declare how your content can be used, or need better visibility into who's accessing your content, we'd like to hear from you.

Get in touch