The New Gatekeepers How AI Chooses Its Sources and What It Means for Public Relations

Today, large language models (LLMs) like ChatGPT, Gemini, and Claude sit between users and the web. They interpret questions and deliver direct answers, often without the user ever clicking a link. For communications teams, the challenge is no longer just ranking in search, it’s ensuring that when an AI goes looking, your brand is discoverable in a world where the first “reader” is not human, but a machine.

This shift is fuelling a new discipline: generative engine optimisation (GEO). Where SEO ensured visibility to human eyes, GEO is about making your client’s or brand’s voice citable by AI.

How Search Engines Work

Search engines remain a foundation of online discovery. Google and its peers crawl the open web continuously, sending automated bots to index pages and rank them in real time. The ranking depends on a blend of factors: backlinks, originality, domain authority, relevance and structured data.

The PR playbook under this system was well established. You placed stories in outlets that Google trusted, ensured a steady rhythm of authoritative mentions, and built digital content that could climb to the top of the results page. Visibility depended on optimising for the crawler and the algorithm.

How Large Language Models Source Data

LLMs function differently. Instead of maintaining a live index, they are trained on large datasets at fixed intervals. Their “knowledge” depends on the material included in those datasets, as well as any fine-tuning through licensed content.

  • Public web scraping provides the largest share. These models are trained on massive scraped datasets like Common Crawl, which capture a snapshot of the web.
  • Models draw on large-scale repositories such as Common Crawl, alongside blogs, forums and news sites.
  • Books and public domain texts, including Project Gutenberg and datasets like Books3, feed structured narrative material into the models.
  • Community platforms such as Reddit and Stack Exchange supply conversational tone and problem-solving discourse (Oxylabs).
  • Scientific and code repositories like ArXiv and GitHub give technical precision.
  • Licensed news archives increasingly feature, as publishers strike deals. OpenAI has agreements with the Associated Press, Axel Springer and others (AP).

Nearly every major LLM relies heavily on Wikipedia, Common Crawl and large public forums. Sentisight research found that ChatGPT drew almost half of its citations from Wikipedia, while Google’s AI Overviews leaned heavily on Reddit (21 per cent) and YouTube (19 per cent). Perplexity, another rising player, cited Reddit in nearly half of its answers.

Differences also exist. OpenAI blends broad web scraping with licensed content. Gemini integrates text, images, video and audio from Google’s ecosystem. Anthropic’s Claude has emphasised open and safe sources, although a lawsuit from Reddit claims it scraped content without permission. Meta’s Llama model uses a blend of Common Crawl, Wikipedia, Project Gutenberg, Books3, ArXiv and Stack Exchange.

Reddit as an LLM Source and Its Risks

Reddit stands out because of its scale and conversational tone. It is invaluable for training models to mimic natural human language. Licensing deals with OpenAI and Google have cemented its role as a cornerstone of LLM data.

Yet Reddit is user-generated. Its content can be brilliant, authentic and diverse. It can also be misleading, biased, or outright false. When LLMs absorb Reddit data, they inherit not only the tone but also the flaws. This creates a tension: the appearance of authenticity without the guarantee of accuracy.

For PR, this duality is significant. Participation in Reddit conversations can improve visibility with AI systems, but it may also place a brand within narratives that are less reliable, less curated and harder to control.

As Reddit’s chief executive Steve Huffman observes, “The world needs community and shared knowledge, and that’s what we do best.”

SEO versus GEO

Traditional SEO sought visibility with human readers. The objective was to appear on page one of a Google search. GEO shifts the focus towards machine readability and citability.

To be useful to an LLM, content must be clear, structured and trustworthy. Press releases, white papers and executive profiles need to be formatted in ways that models can parse easily. Metadata, transcripts and structured summaries are now as important as compelling prose.

Whereas SEO rewarded popularity and linking, GEO rewards clarity, authority and accessibility. In practice, this means PR professionals must think not just about journalists and readers, but also about the machines that may quote their clients.

As Sam Altman of OpenAI puts it: “In the AI layer of the web, you don’t win by simply publishing. You win by being retrievable, parsable, and citable.”

Implications for PR Professionals

Quick Takeaways for PR Pros:

  • Earned media matters more than ever. LLMs tend to favour it.
  • Format content for machines: clarity, structure, citations.
  • Monitor AI mentions, not just search rankings.
  • Engage communities like Reddit carefully.
  • Use licensing controls mindfully to protect client voice.

For In-House Teams:

  • Collaborate across departments to ensure reputational messaging is accessible and structured.
  • Publish FAQs and spokesperson statements in formats that can be easily cited.

For Agencies:

  • Package client thought leadership into GEO-friendly formats such as press kits and summaries.
  • Track “Share of Model” as a KPI, alongside Share of Voice, to measure how often a brand appears in AI answers.

PR has always adapted to gatekeepers, whether newspaper editors, television producers, or search algorithms. The next gatekeepers are not human at all. They are AI systems that filter, interpret, and present information before it ever reaches the public.

For PR professionals, the work is the same at its core: making sure your clients’ voices are heard. The difference is that the first audience isn’t just people, it’s the machines deciding which voices matter.

The brands that learn to speak fluently to both will define the next era of influence.

Is your team preparing for that?


Curzon PR is a London-based PR firm working with clients globally. If you have any questions, please feel free to contact our Business Development Team info@curzonpr.com