AI Crawlers

22+ AI bots. One policy.

GPTBot, ClaudeBot, PerplexityBot, Bytespider and 18 others identified out of the box. Block, allow, throttle, or monetize — per crawler, per path.

22+ AI Crawlers

Complete database of known AI training and search crawlers.

Reverse DNS Verify

Confirm crawlers are genuine. catch spoofed Googlebots.

robots.txt + ai.txt

Generate both files for maximum AI crawler coverage.

TDM Headers

EU-compliant Text & Data Mining reservation headers.

Compliance Tracking

Monitor which AI companies respect your opt-out preferences.

Selective Blocking

Block AI training but allow AI search. Per-crawler control.

Traffic Analytics

Breakdown by AI provider so you see exactly who crawls what.

Content Value

Estimate the value of content extracted by AI crawlers.

22+
AI crawlers identified
5
Major training corpora
rDNS
Identity verified
Daily
Signature updates
Identification

Match the signature, verify the source.

User-agent identification is the first step. Reverse DNS confirms the crawler is who it claims to be — without rDNS we'd be blocking based on string matching alone.

Step 01

Identify the crawler

user-agent signature matched against our maintained database. 22 known AI crawlers, updated as new ones appear. No false matches on partial UAs.

Step 02

Verify the identity

Reverse-DNS lookup confirms the crawler is who it claims to be. Spoofed user-agents fail rDNS and get scored as bot — not as the crawler.

Step 03

Apply your policy

Block, allow, throttle, or charge per crawl. Per-crawler policy. Granular path scoping. Honor opt-out headers automatically.

Catalog

Every major AI crawler, identified.

Five each shown per category. Full catalog has 22 known crawlers + signature updates rolling out daily as new ones surface.

Training corpora5 signals
GPTBotOpenAI

OpenAI's official training crawler. Used for GPT-4 and successors. Honors robots.txt. rDNS *.openai.com.

ClaudeBotAnthropic

Anthropic's training crawler. Honors robots.txt. rDNS *.anthropic.com.

GoogleOtherGoogle

Google's non-search crawler used for AI training and product research. Honors robots.txt. rDNS *.googlebot.com.

BytespiderByteDance

ByteDance crawler. Aggressive, often ignores robots.txt. rDNS *.bytespider.com when honest.

FacebookBotMeta

Meta's training crawler for Llama models. rDNS *.facebook.com.

Search-and-answer5 signals
PerplexityBotPerplexity

Perplexity AI's live search crawler. Visits in real time when users ask questions. Often ignores robots.txt opt-out.

YouBotYou.com

You.com's search crawler. Real-time, summarizes for users.

OAI-SearchBotOpenAI

OpenAI's ChatGPT search crawler. Separate from GPTBot — opt out independently.

PhindBotPhind

Phind's developer-focused search crawler.

AndibotAndi

Andi search assistant crawler.

Aggregators + tools5 signals
CCBotCommonCrawl

Common Crawl. Open dataset used by most LLM training pipelines. Honors robots.txt strictly.

anthropic-aiAnthropic

Anthropic non-training crawler used for live tools and product research.

Cohere-aiCohere

Cohere's training corpus crawler.

AI2BotAI2

Allen Institute for AI's research crawler.

AmazonbotAmazon

Amazon's crawler. Used across Alexa, Bedrock, internal AI.

Controls

What you can do.

Per-crawler dashboard

Requests, bytes, top paths, peak hours — per crawler, last 30 days. See exactly who's scraping you and how much.

Per-crawler policy

Independent toggle per crawler. Allow GoogleOther but block GPTBot. Allow ClaudeBot but block Bytespider. Granular control.

Path-scoped blocks

Block ChatGPT from /premium/* but allow from /free/*. Block training crawlers from articles but allow them on landing pages.

Opt-out header emission

Automatically emits noai, noimageai, X-Robots-Tag headers. Honors robots.txt cleanly. Compliant by default.

Anti-spoof verification

rDNS check on every claimed crawler hit. UA spoofers get scored as bot via their fake claim — not as the crawler.

Pay-per-crawl revenue

Charge crawlers per request. Configure price + currency. Crawlers either pay (Cloudflare-style) or get blocked. Revenue logs included.

53% of internet traffic is automated.
How much of yours?

Most site owners have no idea. Find out in under 2 minutes — free.