We Can't Have Nice Things Because of AI Scrapers
We Can’t Have Nice Things Because of AI Scrapers
The open web is dying, and AI companies are holding the knife. Let me explain what’s happening and why it matters.
The Problem
AI companies need training data. Lots of it. So they crawl the entire web, ignoring robots.txt, overwhelming servers, and extracting everything: text, images, code, personal blogs, forum posts, everything.
In 2025, OpenAI’s crawler (GPTBot) hit my personal blog hard enough that I thought I was under DDoS. Turns out, no—just aggressive scraping for GPT-5 training.
The Numbers
From my server logs (January 2026):
GPTBot: 45,000 requests/day
ClaudeBot: 8,000 requests/day
GoogleBot: 2,000 requests/day (normal crawling)
Bing: 1,500 requests/day
Human visitors: 500 requests/day
AI bots generated 50x more traffic than real humans. And they don’t click ads, don’t convert, don’t engage. They just take.
The Response
Sites are locking down:
- Paywalls everywhere: Can’t scrape what you can’t access
- Aggressive anti-bot measures: CAPTCHAs, rate limits, fingerprinting
- robots.txt arms race: Constantly updating to block new AI bots
- Legal threats: Companies suing AI firms for copyright violations
The result? The open web becomes the closed web.
My robots.txt Evolution
2020:
User-agent: *
Crawl-delay: 1
2026:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
# ...and 20 more
It’s exhausting.
The Technical Impact
1. Server Costs
AI scrapers don’t respect rate limits. They spawn hundreds of concurrent connections, maxing out:
- CPU (parsing requests)
- Bandwidth (serving responses)
- Database connections (if you’re dynamic)
Small sites can’t afford this. They go offline or lock down.
2. Cache Poisoning
AI scrapers hit every URL variation:
/article?utm_source=ai
/article?session=bot123
/article?_=1234567890
This pollutes CDN caches with infinite variations of the same content.
3. Analytics Noise
When 90% of your traffic is bots, analytics become useless. Can’t track real user behavior when it’s drowned out by scrapers.
The Ethical Problem
AI companies argue: “It’s public data, we can use it.” But:
- No consent: Users and creators never agreed to this
- No compensation: They profit from our content without paying
- No attribution: Models don’t cite sources
- No opt-out: Respecting robots.txt is optional for them
Compare to Google: they drive traffic back to you via search. AI models? They give users the answer directly. No click through. No attribution. You get nothing.
What We’re Losing
The open web worked because of an implicit deal:
- Creators share content freely
- Search engines index it
- Users discover content
- Creators get traffic/revenue
AI breaks this:
- Creators share content freely
- AI companies scrape it
- Users get answers from AI
- Creators get nothing
Without the feedback loop, why publish openly?
Solutions?
Optimistic: Licensing
Sites could negotiate with AI companies:
- “Pay $X/month to scrape our content”
- “Include attribution in responses”
- “Respect our rate limits”
Some news orgs are doing this. Most small sites can’t.
Realistic: Arms Race
- More aggressive bot detection
- Honeypot links to identify scrapers
- Rate limiting by ASN
- Legal action (expensive, slow)
Pessimistic: Closed Web
Everything moves behind:
- Login walls
- Paywalls
- CAPTCHAs
- Private communities
The open web becomes a walled garden. We all lose.
My Approach
For now, I’m blocking AI scrapers in robots.txt and using rate limiting. If they ignore it (many do), I’ll add more aggressive measures:
# Rate limit AI bots
limit_req_zone $http_user_agent zone=ai_bots:10m rate=1r/s;
# Block if they ignore robots.txt
location / {
if ($http_user_agent ~* "GPTBot|ClaudeBot") {
return 429;
}
}
But I shouldn’t have to do this. AI companies should respect the web’s norms: robots.txt, rate limits, consent.
The Bigger Picture
This is about more than bots. It’s about who gets to profit from human knowledge. The web was built by millions of people sharing freely. Now, a handful of companies are monetizing all of it without giving back.
We can’t have nice things because of AI scrapers.
And that sucks.