Match the signature, verify the source.
User-agent identification is the first step. Reverse DNS confirms the crawler is who it claims to be — without rDNS we'd be blocking based on string matching alone.
Identify the crawler
user-agent signature matched against our maintained database. 22 known AI crawlers, updated as new ones appear. No false matches on partial UAs.
Verify the identity
Reverse-DNS lookup confirms the crawler is who it claims to be. Spoofed user-agents fail rDNS and get scored as bot — not as the crawler.
Apply your policy
Block, allow, throttle, or charge per crawl. Per-crawler policy. Granular path scoping. Honor opt-out headers automatically.
Every major AI crawler, identified.
Five each shown per category. Full catalog has 22 known crawlers + signature updates rolling out daily as new ones surface.
OpenAI's official training crawler. Used for GPT-4 and successors. Honors robots.txt. rDNS *.openai.com.
Anthropic's training crawler. Honors robots.txt. rDNS *.anthropic.com.
Google's non-search crawler used for AI training and product research. Honors robots.txt. rDNS *.googlebot.com.
ByteDance crawler. Aggressive, often ignores robots.txt. rDNS *.bytespider.com when honest.
Meta's training crawler for Llama models. rDNS *.facebook.com.
Perplexity AI's live search crawler. Visits in real time when users ask questions. Often ignores robots.txt opt-out.
You.com's search crawler. Real-time, summarizes for users.
OpenAI's ChatGPT search crawler. Separate from GPTBot — opt out independently.
Phind's developer-focused search crawler.
Andi search assistant crawler.
Common Crawl. Open dataset used by most LLM training pipelines. Honors robots.txt strictly.
Anthropic non-training crawler used for live tools and product research.
Cohere's training corpus crawler.
Allen Institute for AI's research crawler.
Amazon's crawler. Used across Alexa, Bedrock, internal AI.
What you can do.
Per-crawler dashboard
Requests, bytes, top paths, peak hours — per crawler, last 30 days. See exactly who's scraping you and how much.
Per-crawler policy
Independent toggle per crawler. Allow GoogleOther but block GPTBot. Allow ClaudeBot but block Bytespider. Granular control.
Path-scoped blocks
Block ChatGPT from /premium/* but allow from /free/*. Block training crawlers from articles but allow them on landing pages.
Opt-out header emission
Automatically emits noai, noimageai, X-Robots-Tag headers. Honors robots.txt cleanly. Compliant by default.
Anti-spoof verification
rDNS check on every claimed crawler hit. UA spoofers get scored as bot via their fake claim — not as the crawler.
Pay-per-crawl revenue
Charge crawlers per request. Configure price + currency. Crawlers either pay (Cloudflare-style) or get blocked. Revenue logs included.