Which AI Engines Cite Which Sites: The Definitive Crawler and Index Map

Ivan Boss·

Understanding which ai engines cite which sites is not a branding question — it is a plumbing question. Each major AI answer engine draws from a distinct index and uses a distinct crawler. Get the plumbing wrong and your content is invisible, regardless of how well it ranks in traditional search. This article maps the exact index-to-engine relationships for Google AI Overviews, Gemini, ChatGPT Search, and Perplexity, so you know precisely where your content must be present to earn a citation.

The core finding is blunt: there is no single "AI search." There are four separate pipelines, each with its own gate. Treating them as one leads to coverage gaps that no amount of content quality can fix.


Which Index Powers Google AI Overviews and AI Mode?

Google AI Overviews and AI Mode are powered by Gemini and draw exclusively from Google's own search index. A page must be indexed by Google — not Bing — to be eligible for citation in a Gemini-powered AI Overview.

Google began rolling out AI Overviews in the United States in May 2024, following the public announcement of Search Generative Experience (SGE) at Google I/O in May 2023. These are not separate products with separate indexes — SGE was the prototype; AI Overviews is the production rollout.

The practical gate: confirm your page appears in Google Search Console with a "Discovered" or "Indexed" status. If it is not indexed by Google, no amount of Bing optimization or structured data will earn it a slot in an AI Overview.

HTTPS is a confirmed Google ranking signal, and modern browsers warn users on non-secure pages. A site without a valid SSL certificate faces both a ranking penalty and a trust barrier before any AI citation question even arises.


Does Gemini Use a Different Index Than Google AI Overviews?

Gemini and Google AI Overviews share the same Google index lineage. They are not separate discovery pipelines. Treat them as one target: earn Google indexation, and you are eligible for both.

This distinction matters because some SEO advice frames Gemini as a separate optimization target requiring separate tactics. The index is the same. The content quality and E-E-A-T signals that help you rank in Google Search are the same signals that make you eligible for citation in Gemini's answers.

Google added "Experience" — the first E — to the original E-A-T framework in December 2022, expanding it to E-E-A-T. First-hand experience signals (author credentials, original data, documented case studies) are now a formal part of Google's quality assessment, which flows directly into what Gemini surfaces.


How Does ChatGPT Search Decide Which Sites to Cite?

ChatGPT Search retrieves results primarily from Microsoft Bing's index, so a page must be indexed in Bing to be eligible for citation in ChatGPT. There are two separate gates, and both must be open.

Gate 1: Bing indexation. Submit your sitemap to Bing Webmaster Tools. IndexNow is an open protocol, supported by Microsoft Bing, that lets a website instantly notify participating search engines when URLs are created, updated, or deleted. Using IndexNow speeds your Bing crawl eligibility.

Gate 2: OAI-SearchBot access. OpenAI operates two distinct crawlers: GPTBot, which gathers data for model training, and OAI-SearchBot, which surfaces and links to pages in ChatGPT Search results. A website that blocks OAI-SearchBot in its robots.txt will not appear in ChatGPT Search results even if it ranks well in Bing.

Many sites that blocked GPTBot to opt out of training data collection inadvertently blocked OAI-SearchBot too, cutting themselves out of ChatGPT Search citations. These are separate directives in robots.txt and must be managed separately.

One important clarification: IndexNow notifies Microsoft Bing and Yandex, not Google. It speeds eligibility for ChatGPT Search but has no effect on Google or Gemini visibility. To speed Google and Gemini discovery, use Google's Indexing API or Google Search Console instead.


How Does Perplexity Decide Which Sites to Cite?

Perplexity operates its own crawler, PerplexityBot, which a site must allow in robots.txt to be eligible for citation in Perplexity's answers. Perplexity operates its own crawler rather than relying solely on Google's or Bing's index for discovery.

This makes Perplexity the most independent of the four engines from an indexation standpoint. Your Google and Bing presence is irrelevant if PerplexityBot is blocked. Check your robots.txt file directly: if you have a wildcard Disallow: / rule or an explicit block on PerplexityBot, you are not eligible.

Perplexity is known for heavy citation behavior — it surfaces multiple sources per answer with visible attribution links. That visibility makes it a high-value citation target, particularly for research-oriented queries where readers actively click through to sources.

Answer Engine Optimization (AEO) is the practice of structuring content so AI answer engines such as Perplexity, ChatGPT, and Google's AI Overviews can extract and cite it directly inside their answers. Perplexity rewards the same structural signals: direct answers near the top of sections, short paragraphs, and factual density.


Which AI Engines Cite Which Sites: The Master Checklist

Understanding which ai engines cite which sites comes down to three parallel pipelines. Here is the complete gate map:

Engine Index Source Crawler to Allow Indexing Accelerator
Google AI Overviews Google Search Index Googlebot Google Indexing API / Search Console
Gemini Google Search Index Googlebot Google Indexing API / Search Console
ChatGPT Search Microsoft Bing Index OAI-SearchBot IndexNow
Perplexity Own index (PerplexityBot) PerplexityBot N/A (allow in robots.txt)

Your five-minute coverage audit:

  1. Search your target question in Google. Check whether your domain appears in the AI Overview citations panel.
  2. Search the same question in Gemini. Note whether your domain is linked in the response.
  3. Open ChatGPT Search. Run the same query. Check cited sources in the response footnotes.
  4. Open Perplexity. Run the query. Check the source panel on the right.
  5. Cross-reference any gaps against your robots.txt and your Bing Webmaster Tools index coverage report.

If you appear in Google results but not in AI Overviews, the gap is content structure, not indexation. If you appear in Bing but not in ChatGPT Search, check your robots.txt for an OAI-SearchBot block. If you are absent from Perplexity, check for a PerplexityBot block.


What Content Structure Makes a Page Citable Across All Engines?

Pages that get cited across multiple AI engines share a specific structural pattern, not just topical relevance. The first sentence under each heading answers the heading question directly. Paragraphs stay short. Lists replace run-on enumerations.

Auroxa scores every article on a six-factor AEO Score totaling 100 points: hierarchical headings, Q&A density, fact density, schema completeness, declarative ratio, and citation-friendly format. Auroxa's citation-friendly format factor rewards an average paragraph length of 80 words or fewer and roughly one list per 500 words of body text.

Auroxa's AEO Q&A density factor awards full points when at least 40% of a page's H2 and H3 headings are phrased as questions. Question headings mirror the exact phrasing of queries that users type into AI engines. That match increases the probability of extraction.

Google's Helpful Content system, introduced in 2022, rewards content written for people over content written primarily to rank in search engines. That principle aligns with what AI engines extract: direct, specific, experience-backed answers rather than keyword-stuffed paragraphs.


How Should You Verify Your robots.txt Is Not Blocking AI Crawlers?

Open your robots.txt file at yourdomain.com/robots.txt. Look for these specific directives:

Directives that block ChatGPT Search citations:

  • User-agent: OAI-SearchBot followed by Disallow: /
  • A wildcard User-agent: * with Disallow: / and no explicit Allow for OAI-SearchBot

Directives that block Perplexity citations:

  • User-agent: PerplexityBot followed by Disallow: /

Directives that affect training data but NOT citations:

  • User-agent: GPTBot with Disallow: / — this blocks OpenAI training crawls, not ChatGPT Search retrieval

The GPTBot vs. OAI-SearchBot confusion is the most common robots.txt mistake in the context of which ai engines cite which sites. They are different crawlers with different purposes, and blocking one does not block the other.


Does Publishing Platform Affect AI Citation Eligibility?

Your CMS choice affects how reliably your content is crawled and indexed. WordPress powers approximately 43% of all websites, according to W3Techs, making it the most common CMS small businesses publish on.

The key technical variable is whether your content is rendered as real HTML or injected via JavaScript. Google's John Mueller has noted that client-rendered primary content is weaker for SEO. Auroxa publishes real HTML to a customer's own CMS rather than injecting content with a JavaScript overlay, precisely because JavaScript-rendered content is harder for crawlers to process reliably.

On publish, Auroxa automatically notifies search engines by pinging Google's Indexing API and submitting the URL to IndexNow (supported by Microsoft Bing and Yandex), speeding crawl eligibility including for ChatGPT Search.

Auroxa is a Generative and Answer Engine Optimization (GEO/AEO) platform that publishes knowledge-vault-anchored content to a customer's own CMS and proves ROI through GA4 revenue attribution.


The Bottom Line on Which AI Engines Cite Which Sites

The answer to which ai engines cite which sites is not a single answer — it is four separate answers, each with a distinct prerequisite. Google AI Overviews and Gemini require Google indexation. ChatGPT Search requires Bing indexation plus an open OAI-SearchBot directive. Perplexity requires an open PerplexityBot directive in robots.txt.

Most content advice treats "AI search" as a monolith. The practical reality of which ai engines cite which sites is that you can be perfectly optimized for one engine and completely invisible to another through a single misconfigured line in robots.txt.

The cross-engine principle is this: earn Google indexation, earn Bing indexation, and audit your robots.txt for all three AI-specific crawlers (OAI-SearchBot, PerplexityBot, and Googlebot). Those three steps open the gates. Content structure — answer-first paragraphs, question headings, short paragraphs, factual density — determines whether you get cited once the gates are open.

Understanding which ai engines cite which sites is the prerequisite for any serious AEO strategy. The engines are different. The indexes are different. The crawlers are different. Treat them that way.