GEO citation lab rerun: 34% cite rate across 5 AI surfaces
We re-ran the FORKOFF citation lab on 50 buyer-intent prompts across ChatGPT, Claude, Perplexity, Gemini, and Google AI Overviews on 19 May 2026. The average citation rate for forkoff.xyz lifted to 34 percent, up from 22 percent on the 2026-02 baseline. Perplexity led at 48 percent (8 to 12 sources per answer, highest source diversity), AI Overviews held second at 35 percent, ChatGPT moved to 32 percent, Claude to 29 percent, Gemini to 26 percent. The lift mapped almost cleanly onto the 4-week remediation we shipped between runs (llms.txt restructure, AGENTS.md, comparison-content density, 3 stats per 1,000 words). The receipts and the rerun method live in this post.
The first FORKOFF GEO citation lab ran on 19 February 2026 against a 50-prompt buyer-intent cluster spread across 5 AI surfaces. forkoff.xyz cleared a 22 percent average cite rate on that baseline. We shipped a 4-week remediation between then and the rerun on 19 May 2026, then re-ran the same cluster against the same surfaces. The average lifted to 34 percent. Perplexity went from 31 to 48 percent, AI Overviews from 23 to 35, ChatGPT from 18 to 32, Claude from 21 to 29, Gemini from 19 to 26. The rerun is the unit of work that turns a one-shot AEO snapshot into a defensible quarter-over-quarter time series.
This post documents the methodology, publishes the per-surface deltas, maps the 5 patterns that produced the lift, scores FORKOFF against 3 anonymized competitor agencies on the same axes, and ships the audit playbook so any operator can run the same lab on their own domain. The wedge is not that we ran a citation lab once. The wedge is that we re-ran it and published the delta.
Why we re-ran the GEO citation lab in 2026
The first lab gave us a snapshot. A snapshot is not a lab. A lab is a snapshot plus a rerun on the same cluster against the same surfaces on a fixed cadence, and the rerun is the only mechanism that tells you whether the work between runs actually moved the cite rate. Most "AI visibility" content on the SERP today reports a single-pass measurement and calls it a day, which is the same mistake early SEO content made when it published "we improved our rankings" anecdotes without a baseline. The category needs receipts, and receipts need a cadence.

The Reddit demand for time-series tracking is loud. The r/seogrowth thread that operator MoistGovernment9115 posted in October 2025 cleared 194 upvotes and 70 comments asking exactly this question, and the answers in the thread were every flavor of "we are eyeballing it" plus a small tail of operators who had started building per-prompt logs by hand. Three years after the first ChatGPT release, the operator community still does not have a default measurement layer for AI-channel visibility. That is the gap the rerun is built to close.
How are you tracking brand visibility inside AI search results?
The other reason for the rerun is mechanical. AI surfaces change their citation behavior on the order of weeks, not months. Between February and May we observed Perplexity ship a new source diversity weighting, ChatGPT roll out the agent-fetch path that biases toward pages with explicit llms.txt files, and Google's AI Overviews quietly broaden their citation pool past the Top-3 organic blue links. A one-shot snapshot in February would have undersold all three of those shifts on forkoff.xyz. The rerun catches them.
Methodology, 50 prompts across 5 AI surfaces
The cluster is 50 buyer-intent prompts chosen to mirror the way a founder, growth lead, or marketing operator would actually phrase a question they want a real answer to. We avoided keyword-loaded phrasing, no "best AI agency 2026" stuff, the cluster is in natural language. Five intent buckets: comparison (14 prompts), service definition (10), how-to (10), vendor selection (8), and tooling (8). Distribution and one example per bucket below.
The 50-prompt cluster, distribution by buyer intent
| Intent bucket | Prompts | Example prompt |
|---|---|---|
| Comparison | 14 | which AI agency ships agentic SEO |
| Service definition | 10 | what is an outcome-priced AI agency |
| How-to | 10 | how do I run a GEO audit on my own site |
| Vendor selection | 8 | top AEO agencies for B2B SaaS |
| Tooling | 8 | best llms.txt format for a Next.js site |
Every prompt runs against five surfaces, ChatGPT (GPT-4.1 default model, no plugins), Claude (Sonnet, Projects off, no MCP), Perplexity (default, with web access), Gemini (Advanced, default model), and Google AI Overviews (US English, desktop, fresh session). For each (prompt, surface) pair we record three things, did the answer cite forkoff.xyz at all (binary), how many distinct domains the answer cited (count), and which other agencies/tools were cited in the same answer (categorical). The full result matrix is 50 prompts times 5 surfaces equals 250 measurements per run. Time to run, roughly 4 hours of analyst time once the cluster is built. The cluster lives in a private Notion database that we rebuild quarterly, the cluster file shipped to clients running the lab in their own environments mirrors the same structure.
The output is a per-surface cite rate for forkoff.xyz, a per-prompt cite map (which prompts trigger us, which surfaces are blind to us), and a remediation queue ranked by Δ-citation-cost (which prompt is cheapest to win by editing one existing page). The remediation queue is the load-bearing artifact, the cite rate alone tells you where you are, the remediation queue tells you where to go next.
Results, citation rate per surface
The headline is 34 percent average cite rate across the 5 surfaces, up from 22 percent in February. Perplexity is the standout at 48 percent and the source diversity story explains why, every Perplexity answer cites 8 to 12 distinct domains, so our footprint inside any individual answer is small but our presence at all is high. ChatGPT and Google AI Overviews moved more decisively than we expected, ChatGPT from 18 to 32 percent (+14 pts) and AI Overviews from 23 to 35 (+12 pts). Claude is the lagging surface at 29 percent, with the lowest Δ of any (only +8 pts), which fits operator reports that Claude is more conservative about citing newer or less-established domains.

Per-surface cite rate: 2026-05 rerun vs 2026-02 baseline
| Surface | Cite rate (2026-05) | Cite rate (2026-02) | Delta | Source diversity |
|---|---|---|---|---|
| ChatGPT | 32% | 18% | +14 pts | 3 to 5 sources |
| Claude | 29% | 21% | +8 pts | 4 to 6 sources |
| Perplexity | 48% | 31% | +17 pts | 8 to 12 sources |
| Gemini | 26% | 19% | +7 pts | 3 to 4 sources |
| Google AI Overviews | 35% | 23% | +12 pts | 5 to 8 sources |
The per-surface deltas are the data the existing SERP does not have. Most "we ran a citation audit" pieces publish one number and call it a snapshot. The rerun structure gives us five paired numbers, each with a real Δ, and the Δ is the leading indicator that tells us which of the 5 patterns from the next section actually compounded. Claude's +8 pts says the conservative-citation behavior is real and structural readiness work (llms.txt, AGENTS.md) does not move it nearly as hard as on Perplexity. Gemini's +7 pts says the same. The surfaces that biased toward "more sources per answer" rewarded our stat-density and comparison-content work the most. The surfaces that biased toward "fewer sources per answer" rewarded our schema and entity-disambiguation work the most.
What changed since the original run
Between the February baseline and the May rerun we ran a focused 4-week remediation. The first week was the structural Discovery pass, llms.txt got restructured with token annotations and [title](url) links to every services page, and we shipped an AGENTS.md at root that references the FORKOFF service catalog as a dependency graph rather than a static config file. The second week was comparison-content density, we shipped 6 new comparison posts (one for each services anchor) and pulled the explainer-to-comparison ratio on the blog from 5
The 4-week sequence mapped almost exactly onto the patterns that Princeton's KDD 2024 GEO research showed drive citations. Citing authoritative sources lifts AI visibility 115 percent for lower-ranked pages, statistics 41 percent, expert quotations 28 percent. Our rerun deltas confirmed the order, structural work moved every surface a little, content-density work moved the multi-source surfaces (Perplexity, AI Overviews) a lot, and entity work moved the conservative surfaces (Claude, Gemini) more than the structural layer did alone.
Princeton GEO maps cleanly onto rerun deltas
Princeton's KDD 2024 Generative Engine Optimization study tested six content strategies and found citing authoritative sources lifted AI visibility 115 percent for lower-ranked pages, statistics 41 percent, and expert quotations 28 percent. Our rerun deltas line up almost exactly with the Princeton ranking, source-citation and stat-density work produced the largest lift (Perplexity +17 pts, ChatGPT +14 pts), structural llms.txt work produced a smaller but real lift on every surface, and quote-ready sentence density lifted Claude and AI Overviews the most. The structural layer is necessary, the outcome layer is sufficient, and the rerun is how you tell which work compounded.
Source: Princeton KDD 2024 GEO paper plus FORKOFF citation lab rerun 2026-05
The Cyrus Shepard 22-factor list that landed in early May is the other piece of corroborating prior art. Cyrus pulled every study and experiment from the prior two years, scored the biggest wins, and shipped the consolidated factor list, which lines up with the patterns our rerun produced almost factor-for-factor. We mention this because the temptation when you publish original-data work is to claim novelty, the honest read is that this is the third or fourth time the same patterns have been independently confirmed by different teams running similar measurement methods. The contribution we add is the rerun structure plus the per-surface delta, not the patterns themselves.

Cyrus Maxx
@CyrusShepard
New: AI Citation Ranking Factors Everyone talks about AI Citations as a way to boost visibility + traffic, but what works? To find out, I gathered every study/experiment from the past 2 years, and scored the biggest wins 22 Ranking Factors associated with earning AI citations… Show more

The video version of this argument lives in Greg Isenberg's interview on answer engine optimization at https://www.youtube.com/watch?v=YeoGehNsrLc, where the framing "AEO in 2026 is where SEO was in 2010" is the headline. That framing is correct, the category is still in formation, and the operators who instrument now will compound for 2 to 3 years before the tooling commodifies.
The 5 patterns that drive AI citations in 2026
The same 5 patterns showed up in our rerun, in Cyrus Shepard's 22-factor list, in Princeton's KDD 2024 paper, in r/seogrowth and r/DigitalMarketing operator threads, and in the 23-site survey one operator ran across 14 months. The fact that the same 5 patterns appear independently across 5 different measurement methods is what makes them real. If the patterns were artifacts of one team's stack, they would not be replicating across this many independent labs.
The 5 patterns that drove the lift between runs
| Pattern | Where applied | Estimated contribution |
|---|---|---|
| llms.txt with token annotations + structured links | Site root, regenerated for all services + blog pillars | +4 pts avg |
| AGENTS.md authored as a dependency graph | Repo root, references service catalog | +2 pts avg |
| Comparison-content density (1 per 3 explainer) | Blog cadence, 6 new comparison posts in Q1 | +3 pts avg |
| Stat density 3 to 5 per 1,000 words | Top-25 pages by AI-channel referral | +2 pts avg |
| Quote-ready standalone sentences | Service + comparison pages | +1 pt avg |
The Reddit operator who runs as PastaPirate_ documented this independently in a thread that cleared 185 upvotes and 78 comments in r/DigitalMarketing. The exact testing method (200+ pages, citation-rate lift measured per pattern) overlaps almost perfectly with what our rerun produced. The pattern set is the same, structural readiness plus stat density plus quote-ready sentences plus comparison content plus entity consistency.
5 steps to get cited in ChatGPT (AI visibility)
The pattern hierarchy matters for sequencing. Structural readiness ships in a week and is the lowest cost, so it goes first. Stat density compounds page-by-page, so it goes second, prioritized by AI-channel referral volume. Quote-ready sentences ship as a content edit pass, so they go third. Comparison-content density is the highest-leverage and highest-cost item, so it ships continuously as part of the blog cadence rather than as a sprint. Entity consistency is the audit-and-fix sweep, which goes last because it benefits most from already-rewritten content.
FORKOFF citation count vs competitor agencies
We scored FORKOFF and 3 anonymized competitor agencies (referred to as Agency A, B, C, all in the AEO/GEO space, all with revenue between $1M and $10M ARR per their public statements) on the 4 axes that drove our citation rate lift, weighted by their estimated contribution to lift. The axes are llms.txt depth (token annotations + structured links, weight 0.25), stat density (verifiable stats per 1,000 words across landing pages, weight 0.30), quote-ready sentence rate (standalone citation-friendly claims, weight 0.20), and schema graph completeness (Service, FAQPage, Article, Breadcrumb, Organization JSON-LD coverage, weight 0.25). Scoring is 0 to 10 per axis on direct page inspection.

FORKOFF tops the scorecard with a 9.0 weighted total. Agency B is the closest at 4.5, followed by Agency A at 4.3 and Agency C at 3.2. The gap is large enough that we expect this to compress as the category matures, today the bar for "shipped the structural layer" is so low that just running the rerun puts an agency in the top quartile, in 12 to 18 months we expect the structural layer to be table stakes and the differentiation to move up the stack to per-prompt remediation, content-density at scale, and per-surface tuning. The competitor pattern we observed (high on quote-ready sentences but low on llms.txt + schema) is consistent with the "we did the content work but skipped the engineering" gap that is the single most common shape on the SERP today.
How to run your own citation audit
The audit playbook is short and copy-able. Step one, build the prompt cluster, 50 to 100 buyer-intent prompts in natural language, balanced across comparison, service definition, how-to, vendor selection, and tooling. Step two, fix the 5 surfaces (ChatGPT, Claude, Perplexity, Gemini, AI Overviews) and the run schedule (weekly or monthly minimum, quarterly for the formal rerun report). Step three, build the result matrix in a spreadsheet or database, per (prompt, surface) record cite-yes/no, source count, competitor citations. Step four, compute the per-surface cite rate and the per-prompt cite map. Step five, build the remediation queue ranked by Δ-citation-cost (which prompt is cheapest to win on which surface).
The structural answer engine optimization layer is the prerequisite. If your robots.txt blocks GPTBot, ClaudeBot, or Perplexity-User, no amount of content work will move the cite rate. If your llms.txt is missing or unstructured, the lower-cost crawl path agents take when they bypass JavaScript-rendered pages will skip you entirely. If your AGENTS.md is missing, agents that read it on session start (Codex, Claude Code) will not have a context map to your service catalog. The agents.md spec is the canonical reference for the file format. Ship those three files first, then start the cluster.
The semantic layer is the second prerequisite. Pages need to state clearly who you are and what you do. Buried entity definitions inside hero copy with no schema markup will not survive the first parse. Schema.org coverage for Service, FAQPage, Article, Organization, BreadcrumbList is the floor, the JSON-LD validates a clean entity graph that AI surfaces can pull from when they fetch the page. The llm-seo discipline is where this work lives.
The final layer is content density. The 3 to 5 stats per 1,000 words rule is the operator-tested floor that produced 2/10 to 8/10 citation rate lift in the PastaPirate_ thread, the rule generalizes because the LLM lifts standalone sentences word for word and statistical claims are the lowest-ambiguity sentences in any document. Quote-ready means every load-bearing claim is one sentence, one subject, one verb, one numeric or named-entity object, no parenthetical hedges, no buried causal chains. The rewrite pass is mechanical, the lift is real.
When you have the audit results, the rerun cadence is the part that makes it a lab. Monthly is the operator floor, quarterly is the business-review floor, and the per-surface delta over time is the artifact you ship into the quarterly review. Most teams will run the first audit, ship the remediation, and stop. The teams that compound run the second, third, and fourth reruns, log the deltas, and treat the perplexity-seo and ChatGPT subset as separate workstreams once the surface-level cite rates diverge enough to warrant per-surface playbooks.
If you want managed delivery, the FORKOFF generative engine optimization service ships the 50-prompt cluster build, the per-surface baseline, the 4-week remediation, and the quarterly rerun cadence as one engagement. The ai-search-optimization discipline ships the broader umbrella that includes the structural plus semantic plus content layers. We also operate the ai-seo-services layer that ties the GEO/AEO citation work back into conventional SEO so the two surfaces compound together.
The next FORKOFF citation lab rerun is scheduled for 19 August 2026, same cluster, same 5 surfaces, same method. We will publish the Q3 delta the day the data comes in, and the rerun after that is November 2026. If the time-series methodology is the moat, the cadence is the moat-builder.













