CrawlGuard — Enterprise AI Bot Detection

22+

AI crawlers identified

Major training corpora

rDNS

Identity verified

Daily

Signature updates

Identification

Match the signature, verify the source.

User-agent identification is the first step. Reverse DNS confirms the crawler is who it claims to be — without rDNS we'd be blocking based on string matching alone.

Step 01

Identify the crawler

user-agent signature matched against our maintained database. 22 known AI crawlers, updated as new ones appear. No false matches on partial UAs.

Step 02

Verify the identity

Reverse-DNS lookup confirms the crawler is who it claims to be. Spoofed user-agents fail rDNS and get scored as bot — not as the crawler.

Step 03

Apply your policy

Block, allow, throttle, or charge per crawl. Per-crawler policy. Granular path scoping. Honor opt-out headers automatically.

Catalog

Every major AI crawler, identified.

Five each shown per category. Full catalog has 22 known crawlers + signature updates rolling out daily as new ones surface.

Training corpora5 signals

GPTBotOpenAI

OpenAI's official training crawler. Used for GPT-4 and successors. Honors robots.txt. rDNS *.openai.com.

ClaudeBotAnthropic

Anthropic's training crawler. Honors robots.txt. rDNS *.anthropic.com.

GoogleOtherGoogle

Google's non-search crawler used for AI training and product research. Honors robots.txt. rDNS *.googlebot.com.

BytespiderByteDance

ByteDance crawler. Aggressive, often ignores robots.txt. rDNS *.bytespider.com when honest.

FacebookBotMeta

Meta's training crawler for Llama models. rDNS *.facebook.com.

Search-and-answer5 signals

PerplexityBotPerplexity

Perplexity AI's live search crawler. Visits in real time when users ask questions. Often ignores robots.txt opt-out.

YouBotYou.com

You.com's search crawler. Real-time, summarizes for users.

OAI-SearchBotOpenAI

OpenAI's ChatGPT search crawler. Separate from GPTBot — opt out independently.

PhindBotPhind

Phind's developer-focused search crawler.

AndibotAndi

Andi search assistant crawler.

Aggregators + tools5 signals

CCBotCommonCrawl

Common Crawl. Open dataset used by most LLM training pipelines. Honors robots.txt strictly.

anthropic-aiAnthropic

Anthropic non-training crawler used for live tools and product research.

Cohere-aiCohere

Cohere's training corpus crawler.

AI2BotAI2

Allen Institute for AI's research crawler.

AmazonbotAmazon

Amazon's crawler. Used across Alexa, Bedrock, internal AI.

Controls

What you can do.

Per-crawler dashboard

Requests, bytes, top paths, peak hours — per crawler, last 30 days. See exactly who's scraping you and how much.

Per-crawler policy

Independent toggle per crawler. Allow GoogleOther but block GPTBot. Allow ClaudeBot but block Bytespider. Granular control.

Path-scoped blocks

Block ChatGPT from /premium/* but allow from /free/*. Block training crawlers from articles but allow them on landing pages.

Opt-out header emission

Automatically emits noai, noimageai, X-Robots-Tag headers. Honors robots.txt cleanly. Compliant by default.

Anti-spoof verification

rDNS check on every claimed crawler hit. UA spoofers get scored as bot via their fake claim — not as the crawler.

Pay-per-crawl revenue

Charge crawlers per request. Configure price + currency. Crawlers either pay (Cloudflare-style) or get blocked. Revenue logs included.

22+ AI bots. One policy.

22+ AI Crawlers

Reverse DNS Verify

robots.txt + ai.txt

TDM Headers

Compliance Tracking

Selective Blocking

Traffic Analytics

Content Value

Match the signature, verify the source.

Identify the crawler

Verify the identity

Apply your policy

Every major AI crawler, identified.

What you can do.

Per-crawler dashboard

Per-crawler policy

Path-scoped blocks

Opt-out header emission

Anti-spoof verification

Pay-per-crawl revenue

53% of internet traffic is automated.
How much of yours?

22+ AI bots. One policy.

22+ AI Crawlers

Reverse DNS Verify

robots.txt + ai.txt

TDM Headers

Compliance Tracking

Selective Blocking

Traffic Analytics

Content Value

Match the signature, verify the source.

Identify the crawler

Verify the identity

Apply your policy

Every major AI crawler, identified.

What you can do.

Per-crawler dashboard

Per-crawler policy

Path-scoped blocks

Opt-out header emission

Anti-spoof verification

Pay-per-crawl revenue

53% of internet traffic is automated.How much of yours?

53% of internet traffic is automated.
How much of yours?