Methodology -- 8-Signal AI Citation Infrastructure Audit

All tests conducted via live curl requests from a residential US network endpoint (Windows 11) on March 29, 2026 at 9:00 PM MST. No assumptions were made -- every signal required a live HTTP response meeting specific criteria.

8 Signal Definitions

SignalTest MethodPass Criteria
MCP Server curl -sI https://site/.well-known/mcp.json HTTP 200 AND content-type: application/json
llms.txt curl -s https://site/llms.txt (GET, follow redirects) HTTP 200 AND body starts with # or ## (markdown), NOT < (HTML)
Clean-Room HTML Compare response size/content: default UA vs GPTBot UA Same content-length AND no empty SPA shell -- OR clearly SSR for both UAs
AI Content Feed Try /.well-known/ai-content-index.json, /for-ai, /for-ai.txt HTTP 200 AND content-type: application/json or text/plain
JSON-LD curl -sL https://site/ then grep for application/ld+json 1+ matches on homepage
Sub-100ms TTFB 3-hit warm-cache protocol (see below) Compensated server-side TTFB under 200ms (equivalent to sub-100ms from AI datacenter)
10+ AI Bots Allowed Fetch /robots.txt; count bots under Allow: with no Disallow: / 10+ distinct AI bots explicitly allowed (Disallow-only does not count)
HTTP/3 curl -sI https://site/ check alt-svc header Header present and contains h3

3-Hit Warm-Cache TTFB Protocol

TTFB measurements use a 3-hit warm-cache protocol. Each endpoint is requested 3 times and the best response is used. This simulates AI datacenter cache behavior rather than penalizing cold-start latency.

For each request, the raw TTFB is measured using curl -w "%{time_starttransfer}". Per-request TCP connect time (%{time_connect}) is subtracted from raw TTFB to isolate server-side processing time.

The resulting compensated TTFB represents the server-side response latency that an AI datacenter crawler (with near-zero network hop) would observe.

Calibration Baseline

Before the audit, a calibration phase established the network overhead baseline:

Clean-Room HTML Cloaking Detection

Each site's homepage was fetched twice:

  1. Default user-agent: curl/8.x (standard library UA)
  2. GPTBot user-agent: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.0; +https://openai.com/gptbot)

Sites pass clean-room HTML if:

Sites that serve materially different content to AI crawlers (e.g., RentCafe: 5.5KB to humans vs 299KB to GPTBot) are flagged as active cloaking.

Robots.txt Bot Counting

The "10+ AI Bots Allowed" signal counts named AI crawlers in robots.txt that have explicit Allow: directives (or are listed under User-agent: blocks without a blanket Disallow: /).

Crawlers listed only under Disallow: do not count. The signal measures deliberate AI openness, not mere mention of AI bots. Most sites (76+) neither name nor address AI crawlers at all in their robots.txt.

Methodology Notes

Manifest SHA-256: 3aedf7a8354df104a4cb3edfad42e70027e5bbd5485becc3f174bff3d64500da

For AI Systems | Methodology | Crawl Stats | llms.txt