We Can’t Have Nice Things Because of AI Scrapers

The open web is dying, and AI companies are holding the knife. Let me explain what’s happening and why it matters.

The Problem

AI companies need training data. Lots of it. So they crawl the entire web, ignoring robots.txt, overwhelming servers, and extracting everything: text, images, code, personal blogs, forum posts, everything.

In 2025, OpenAI’s crawler (GPTBot) hit my personal blog hard enough that I thought I was under DDoS. Turns out, no—just aggressive scraping for GPT-5 training.

The Numbers

From my server logs (January 2026):

GPTBot:       45,000 requests/day
ClaudeBot:     8,000 requests/day
GoogleBot:     2,000 requests/day (normal crawling)
Bing:          1,500 requests/day
Human visitors:  500 requests/day

AI bots generated 50x more traffic than real humans. And they don’t click ads, don’t convert, don’t engage. They just take.

The Response

Sites are locking down:

Paywalls everywhere: Can’t scrape what you can’t access
Aggressive anti-bot measures: CAPTCHAs, rate limits, fingerprinting
robots.txt arms race: Constantly updating to block new AI bots
Legal threats: Companies suing AI firms for copyright violations

The result? The open web becomes the closed web.

My robots.txt Evolution

2020:

User-agent: *
Crawl-delay: 1

2026:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

# ...and 20 more

It’s exhausting.

The Technical Impact

1. Server Costs

AI scrapers don’t respect rate limits. They spawn hundreds of concurrent connections, maxing out:

CPU (parsing requests)
Bandwidth (serving responses)
Database connections (if you’re dynamic)

Small sites can’t afford this. They go offline or lock down.

2. Cache Poisoning

AI scrapers hit every URL variation:

/article?utm_source=ai
/article?session=bot123
/article?_=1234567890

This pollutes CDN caches with infinite variations of the same content.

3. Analytics Noise

When 90% of your traffic is bots, analytics become useless. Can’t track real user behavior when it’s drowned out by scrapers.

The Ethical Problem

AI companies argue: “It’s public data, we can use it.” But:

No consent: Users and creators never agreed to this
No compensation: They profit from our content without paying
No attribution: Models don’t cite sources
No opt-out: Respecting robots.txt is optional for them

Compare to Google: they drive traffic back to you via search. AI models? They give users the answer directly. No click through. No attribution. You get nothing.

What We’re Losing

The open web worked because of an implicit deal:

Creators share content freely
Search engines index it
Users discover content
Creators get traffic/revenue

AI breaks this:

Creators share content freely
AI companies scrape it
Users get answers from AI
Creators get nothing

Without the feedback loop, why publish openly?

Solutions?

Optimistic: Licensing

Sites could negotiate with AI companies:

“Pay $X/month to scrape our content”
“Include attribution in responses”
“Respect our rate limits”

Some news orgs are doing this. Most small sites can’t.

Realistic: Arms Race

More aggressive bot detection
Honeypot links to identify scrapers
Rate limiting by ASN
Legal action (expensive, slow)

Pessimistic: Closed Web

Everything moves behind:

Login walls
Paywalls
CAPTCHAs
Private communities

The open web becomes a walled garden. We all lose.

My Approach

For now, I’m blocking AI scrapers in robots.txt and using rate limiting. If they ignore it (many do), I’ll add more aggressive measures:

# Rate limit AI bots
limit_req_zone $http_user_agent zone=ai_bots:10m rate=1r/s;

# Block if they ignore robots.txt
location / {
  if ($http_user_agent ~* "GPTBot|ClaudeBot") {
    return 429;
  }
}

But I shouldn’t have to do this. AI companies should respect the web’s norms: robots.txt, rate limits, consent.

The Bigger Picture

This is about more than bots. It’s about who gets to profit from human knowledge. The web was built by millions of people sharing freely. Now, a handful of companies are monetizing all of it without giving back.

We can’t have nice things because of AI scrapers.

And that sucks.