Skip to main content
AI AnalyticsBy Kevin O'Connell10 min readJune 26, 2026

AI Crawlers in 2026: Which Bots to Allow and Which to Block

We scanned 153 B2B SaaS sites: only 1.4% block an AI search crawler in robots.txt. The real gatekeeper is the edge, where 45% sit behind Cloudflare's default AI block. The full crawler reference, copy-paste robots.txt templates, and how to fix the Cloudflare block.

Almost no B2B SaaS company blocks AI crawlers in its robots.txt. We scanned 153 of them to be sure, and the result was lopsided: only a tiny fraction turn away the crawlers that ChatGPT, Perplexity, and Claude use to find pages. So if your brand is missing from AI answers, your robots.txt is almost certainly not the reason. The block is real, but it is usually happening one layer up, where most checkers never look.

This guide is the 2026 reference for AI crawler access: every major bot and whether to allow or block it, copy-paste robots.txt templates, and the edge-layer trap (Cloudflare) that quietly overrules a perfect robots.txt. It pairs with our free AI Bot Access Checker.

  • In our June 2026 scan, only 1.4% of 148 readable B2B SaaS robots.txt files blocked an AI search crawler. robots.txt is rarely the problem.
  • The real gatekeeper is the edge. 45% of the sites sit behind Cloudflare, which has defaulted new domains to blocking AI crawlers since July 2025, regardless of robots.txt.
  • Allow the retrieval crawlers that earn citations (OAI-SearchBot, Claude-SearchBot, PerplexityBot). Decide on training crawlers (GPTBot, ClaudeBot, CCBot) based on your content stance. Then confirm your edge agrees.

We scanned 153 B2B SaaS sites. Almost none block AI search.

To find out whether robots.txt is actually where AI visibility goes to die, we pulled the robots.txt of 153 recognizable B2B SaaS companies across CRM, marketing, dev tooling, analytics, security, and support, then classified how each one treats 17 known AI crawlers using the same parser that powers our public checker. 148 returned a readable file.

AI Bot Blocking Index
153 B2B SaaS sites scanned, June 2026. 148 returned a readable robots.txt.
1.4%
block an AI search crawler
in robots.txt (2 of 148 sites)
2.0%
block GPTBot
OpenAI's training crawler
45%
sit behind Cloudflare
exposed to its default AI block

The pattern is clear. Blocking is rare, and where it happens it is deliberate and narrow. Only three sites blocked GPTBot (figma.com, mutiny.com, loom.com), and of those, mutiny and loom still allow the retrieval crawlers, the textbook "do not train on me, but do cite me" stance. Figma is the single site in the sample that blocks OpenAI's retrieval crawler outright. One site in 148 blocks ChatGPT Search. That is the headline: for this market, robots.txt is not a wall, it is an open door.

Only one of 148 B2B SaaS sites blocks ChatGPT Search in robots.txt. If you are missing from AI answers, the file you keep editing is almost never the cause.

Note the honest limits. This is a curated sample of well-known brands, not a random draw, and "served via Cloudflare" is measured from response headers, not a per-site test of whether each one is actively turning bots away. But the direction is unambiguous, and it reframes the entire question. The interesting story is not the 1.4% who block in robots.txt. It is the 45% sitting behind an edge that can block for them, silently.

Why your robots.txt is not the problem: the two-layer model

AI crawler access is decided at two layers, and they are easy to confuse because only one of them is visible to you as a text file you can edit.

Layer 1 — robots.txt (a request)
A text file that politely asks crawlers what they may fetch. Well-behaved AI crawlers read it and obey. It does not, and cannot, force anything. For the sites we scanned, this layer is almost always wide open to AI.
Layer 2 — the edge / CDN (enforcement)
Cloudflare, Akamai, Fastly, DataDome, and other edge providers sit in front of your origin and decide which requests get through at all. This is the bouncer. Since July 2025, Cloudflare's default for new domains is to turn AI crawlers away here, before they ever reach Layer 1.

Robots.txt is a request. It is the polite sign on the door that says "staff only past this point." Honest crawlers honor it. But it has no power to stop anything, and it is the layer almost everyone fixates on because it is the one you can see and change. The edge is enforcement. It is the bouncer who decides whether the visitor gets in the building at all, and it runs before your robots.txt is ever read. When the two disagree, the edge wins every time.

This is why a site can have a flawless robots.txt and still earn zero AI citations. The origin says "allow OAI-SearchBot." The edge says "block unknown bots." OpenAI's crawler is turned away at the door, your page is never fetched, and nothing in your robots.txt can override it. The fix is never in the file you keep editing. It is in the layer above it.

The real gatekeeper: Cloudflare blocks AI crawlers by default

On July 1, 2025, in an announcement it called Content Independence Day, Cloudflare changed the default for new domains to block AI crawlers unless they pay to access content. The intent was to give publishers leverage over AI companies. The side effect, for a B2B marketer trying to get cited, is that signing up for Cloudflare can switch off your AI visibility without anyone touching robots.txt.

In our scan, 45% of sites are served through Cloudflare. We did not test each one's bot settings, so we cannot say how many actively block, but every one of them is a site where the AI-crawler decision is being made at the edge, not in the file their marketing team can see. And the effect is not theoretical: three sites in our sample (box.com, gusto.com, otterly.ai) returned an outright block to our own checker's request, a small live demonstration of an edge turning an unfamiliar visitor away before robots.txt is ever consulted.

Your robots.txt says allow OAI-SearchBot. Your Cloudflare default says block unknown bots. Cloudflare wins, OpenAI never reaches you, and your AI visibility quietly dies.

The same dynamic exists at Akamai, Fastly, Imperva, and DataDome. Each treats unfamiliar automated traffic as suspicious by default. None of them reads your robots.txt before deciding. So the first question for any site missing from AI answers is not "what does my robots.txt say," it is "what does my edge do with an AI crawler it does not recognize." We will check that directly in a moment.

Every AI crawler in 2026: which to allow, which to block

There are three kinds of AI crawler, and the right default is different for each. Retrieval crawlers fetch pages so an AI can cite them in a live answer, which is how you earn visibility and traffic, so you allow them. User-initiated crawlers fire only when a real person asks an AI to open your link, so you allow them too. Training crawlers collect content to train models with no attribution back, so blocking them is a content-licensing decision, not a visibility one.

CrawlerOperatorWhat it does
Retrieval / searchAllow — these earn your AI citations
OAI-SearchBotOpenAIIndexes pages for ChatGPT Search answers
Claude-SearchBotAnthropicIndexes pages for Claude's web answers
PerplexityBotPerplexityIndexes pages for Perplexity answers
ApplebotApplePowers Siri and Apple Intelligence search
DuckAssistBotDuckDuckGoPowers DuckAssist summaries
User-initiatedAllow — a real person asked the AI to fetch the page
ChatGPT-UserOpenAIFires when a ChatGPT user opens your link
Claude-UserAnthropicFires when a Claude user opens your link
Perplexity-UserPerplexityFires when a Perplexity user opens your link
Ads validationAllow if you run (or may run) ChatGPT Ads
OAI-AdsBotOpenAIValidates ChatGPT Ads landing pages before campaigns run
TrainingYour call — block to opt out of model training
GPTBotOpenAICollects content to train OpenAI foundation models
ClaudeBotAnthropicCollects content to train Anthropic models
Google-ExtendedGoogleOpt-out token for Gemini model training
Meta-ExternalAgentMetaCollects content to train Meta AI
BytespiderByteDanceCollects content to train ByteDance models
CCBotCommon CrawlOpen dataset used by many model trainers
cohere-aiCohereCollects content to train Cohere models
Applebot-ExtendedAppleOpt-out token for Apple model training

The single most expensive mistake is blocking a retrieval crawler while trying to block training. OAI-SearchBot sounds like it might gather training data; it does not. It is the crawler that lets ChatGPT cite you. The training crawler you actually meant to block is GPTBot. The two are independent: blocking GPTBot opts you out of OpenAI training and leaves your ChatGPT citations untouched, while blocking OAI-SearchBot removes you from ChatGPT answers and does nothing about training. For the deeper version of that specific decision, see our GPTBot vs OAI-SearchBot breakdown.

Copy-paste robots.txt templates for 3 scenarios

Pick the block that matches your stance, paste it at your site root, and confirm it serves at https://yourdomain.com/robots.txt with an HTTP 200. Group each crawler under its own user-agent line rather than relying on a wildcard, since the operators control each bot independently.

Scenario 1: Maximum AI visibility (recommended for most B2B)

Allow every retrieval and user-initiated crawler. This is the right default for most SaaS, services, and content sites that want to be cited everywhere AI answers appear.

robots.txt (maximum AI visibility)
# Allow AI retrieval + user-initiated crawlers (earn citations)
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Applebot
User-agent: DuckAssistBot
Allow: /

# Everyone else: normal crawl
User-agent: *
Allow: /

Scenario 2: Cited but not trained on (privacy-conscious B2B)

Allow retrieval, block training. You stay eligible for AI citations while opting out of the model-training pipelines that give nothing back.

robots.txt (allow retrieval, block training)
# Allow retrieval crawlers (for AI citations)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Applebot
Allow: /

# Block training crawlers (opt out of model training)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Applebot-Extended
User-agent: cohere-ai
Disallow: /

Scenario 3: Block all AI (paywalled, regulated, licensed content)

Block training and retrieval. You exit AI answers entirely. Fit for paywalled publishers and regulated industries where every form of AI access is restricted.

robots.txt (block all AI)
# Block all major AI crawlers, training and retrieval
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Applebot-Extended
User-agent: cohere-ai
Disallow: /

One trap to avoid: an empty Disallow: line with no path means allow everything, not block everything. A crawler is only blocked from your whole site when its group contains a bare Disallow: /. Mixing these up is the most common robots.txt error, and it is why a checker that parses the file correctly matters more than eyeballing it.

How to check and fix Cloudflare's AI-crawler block

Since the edge is where the real block usually lives, this is the highest-leverage thing to check. In Cloudflare, open the dashboard and go to Security, then Events. Filter the user-agent field for OAI-SearchBot, PerplexityBot, and ClaudeBot. If the action column shows block or managed challenge, the default AI-crawler rule is turning citations away.

Cloudflare exposes a direct control under Security, then Settings (in some plans, Bots), where AI crawler blocking can be toggled. If you want AI citations, that control should permit the retrieval crawlers. For finer-grained control, add a WAF custom rule that lets the crawlers you want skip the bot checks:

Cloudflare WAF custom rule (expression)
# Let AI retrieval crawlers through the bot checks
(http.user_agent contains "OAI-SearchBot") or
(http.user_agent contains "Claude-SearchBot") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Applebot")

Action: Skip
Skip:
  - Bot Fight Mode
  - Super Bot Fight Mode
  - All remaining custom rules

The principle generalizes to every edge provider: your CDN and firewall rules must mirror your robots.txt intent, or robots.txt is silently overruled. Allowlist the same retrieval crawlers at Akamai, Fastly, Imperva, or DataDome if you run them. Verify legitimacy by reverse-resolving a sample of crawler IPs against the operator's published ranges, so a spoofed user-agent does not earn a free pass.

Not sure whether AI crawlers can actually reach your site? The free AI Bot Access Checker tests your live robots.txt against 17 AI crawlers and flags Cloudflare at the edge. Takes 60 seconds, no signup.

Run the free AI Bot Access Checker

How to confirm AI crawlers can actually reach you

Editing robots.txt and edge rules is only half the job. The other half is confirming the change worked from the outside, the way a crawler sees it. Three checks, in order of effort:

  • Run the checker. Our AI Bot Access Checker tests your live URL against all 17 crawlers and detects whether you are served through Cloudflare, so you see both layers at once.
  • Read your server logs. Look for retrieval user-agents (OAI-SearchBot, PerplexityBot, Claude-SearchBot). If you see Googlebot but never these, the edge is filtering them before they reach your origin. Tracking this over time is covered in our bot activity guide.
  • Watch your citations. Reaching the page is necessary but not sufficient; the page still has to be worth citing. Once crawlers can get in, the work shifts to answer engine optimization: structured, quotable content that AI engines choose to surface.

Crawler access is the floor, not the ceiling. It gets you eligible. Earning the citation is a separate discipline, and the two together are what move AI visibility. But you cannot optimize your way into an answer if the bot never reaches the page, which is exactly why the edge check has to come first.

Frequently asked questions

#Does blocking AI crawlers in robots.txt actually hide me from ChatGPT and Perplexity?

Only if you block the retrieval crawlers specifically. AI engines use separate crawlers for training (GPTBot, ClaudeBot, CCBot) and for live retrieval (OAI-SearchBot, Claude-SearchBot, PerplexityBot). Blocking a training crawler opts you out of model training but does not affect citations. Blocking a retrieval crawler removes you from that engine's cited answers entirely. Most sites that want privacy mean to block training, not retrieval.

#Do most B2B SaaS companies block AI crawlers in their robots.txt?

No. In a June 2026 scan of 153 B2B SaaS sites, only 1.4% blocked any AI search crawler in robots.txt and only about 2% blocked GPTBot. For this market, robots.txt is overwhelmingly open. If you are missing from AI answers, robots.txt is almost never the cause. The more common block happens one layer up, at the CDN or edge.

#Why would my site be invisible to AI search if my robots.txt allows the crawlers?

Because robots.txt is only a request, not enforcement. A CDN or web application firewall (Cloudflare, Akamai, Fastly, DataDome) can stop an AI crawler at the edge before it ever reads your robots.txt. Since July 1, 2025, new Cloudflare domains default to blocking AI crawlers. Your origin can say allow OAI-SearchBot while your edge silently blocks it, and the edge wins.

#How do I check if Cloudflare is blocking AI crawlers on my site?

Open the Cloudflare dashboard, go to Security and then Events, and filter the user-agent field for OAI-SearchBot, PerplexityBot, and ClaudeBot. If you see a blocked action, the default AI-crawler rule is the cause. You can also check your origin server logs: if you see Googlebot but never OAI-SearchBot or PerplexityBot, the edge is filtering them out before they reach you.

#What is the difference between a training crawler and a retrieval crawler?

A training crawler (GPTBot, ClaudeBot, Google-Extended, CCBot) collects content to train foundation models, with no link or attribution back to you. A retrieval crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetches pages so the AI can cite them in a live answer, which can send you traffic and visibility. The default B2B stance is allow retrieval crawlers, decide on training crawlers based on your content-licensing stance.

#Should I block GPTBot?

Blocking GPTBot opts your content out of OpenAI's foundation-model training and nothing else. It does not affect your eligibility for ChatGPT Search citations, which are controlled by OAI-SearchBot. Block GPTBot if you do not want your content used to train models; allow it if you do. Either way, keep OAI-SearchBot allowed so ChatGPT can still cite you.

#Does an empty Disallow line block AI crawlers?

No. An empty Disallow with no path means allow everything, which is the opposite of a block. A user-agent group with only Disallow and a blank value places no restriction on that crawler. A site is only blocking the root when it has a bare Disallow: / line. This trips up many robots.txt checkers, which is why you should verify with a tool that parses the file correctly.

Kevin O'Connell
Kevin O'Connell
Founder & AEO Consultant, AI-Advisors.ai

20-year B2B SaaS marketer. 3x Head of Marketing. One company exit (Sapling HR acquired by Kallidus, 2021). Now building AI-Advisors.ai to give mid-market B2B teams the AI visibility tools enterprise brands get. Writing about Answer Engine Optimization, ChatGPT Ads, Microsoft Copilot SEO, and the 5 A's of AI Marketing framework.

Start tracking your AI visibility today

Install the tracking snippet, run your first audit, and see how AI platforms treat your brand. Start your 7-day free trial.

Get Started Free

Keep Reading

AI Analytics
GPTBot vs OAI-SearchBot: Why Blocking the Wrong One Kills Your ChatGPT Visibility
11 min read
AI Analytics
Is AI Search Driving Traffic to Your Website?
9 min read
AI Analytics
How to Track AI Bot Activity on Your Website
9 min read