Methodology -- 8-Signal AI Citation Infrastructure Audit

8 Signal Definitions

Signal	Test Method	Pass Criteria
MCP Server	`curl -sI https://site/.well-known/mcp.json`	HTTP 200 AND content-type: application/json
llms.txt	`curl -s https://site/llms.txt` (GET, follow redirects)	HTTP 200 AND body starts with `#` or `##` (markdown), NOT `<` (HTML)
Clean-Room HTML	Compare response size/content: default UA vs GPTBot UA	Same content-length AND no empty SPA shell -- OR clearly SSR for both UAs
AI Content Feed	Try `/.well-known/ai-content-index.json`, `/for-ai`, `/for-ai.txt`	HTTP 200 AND content-type: application/json or text/plain
JSON-LD	`curl -sL https://site/` then grep for `application/ld+json`	1+ matches on homepage
Sub-100ms TTFB	3-hit warm-cache protocol (see below)	Compensated server-side TTFB under 300ms (equivalent to sub-100ms from AI datacenter)
10+ AI Bots Allowed	Fetch `/robots.txt`; count bots under `Allow:` with no `Disallow: /`	10+ distinct AI bots explicitly allowed (Disallow-only does not count)
HTTP/3	`curl -sI https://site/` check alt-svc header	Header present and contains `h3`

3-Hit Warm-Cache TTFB Protocol

TTFB measurements use a 3-hit warm-cache protocol. Each endpoint is requested 3 times and the best response is used. This simulates AI datacenter cache behavior rather than penalizing cold-start latency.

For each request, the raw TTFB is measured using curl -w "%{time_starttransfer}". Per-request TCP connect time (%{time_connect}) is subtracted from raw TTFB to isolate server-side processing time.

The resulting compensated TTFB represents the server-side response latency that an AI datacenter crawler (with near-zero network hop) would observe.

Clean-Room HTML Cloaking Detection

Each site's homepage was fetched twice:

Default user-agent: curl/8.x (standard library UA)
GPTBot user-agent: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.0; +https://openai.com/gptbot)

Sites pass clean-room HTML if:

Both responses have comparable content-length (within 2x)
Neither response is an empty SPA shell (less than 500 bytes of meaningful content)
Server-rendered (SSR) content is present in both responses

Sites that serve materially different content to AI crawlers (e.g., RentCafe: 5.5KB to humans vs 299KB to GPTBot) are flagged as active cloaking.

Robots.txt Bot Counting

The "10+ AI Bots Allowed" signal counts named AI crawlers in robots.txt that have explicit Allow: directives (or are listed under User-agent: blocks without a blanket Disallow: /).

Crawlers listed only under Disallow: do not count. The signal measures deliberate AI openness, not mere mention of AI bots. Most sites (76+) neither name nor address AI crawlers at all in their robots.txt.

Methodology Notes

TTFB measurements represent a point-in-time snapshot (March 29, 2026, 9:00 PM MST).
HTTP/3 detection relies on Alt-Svc headers; some CDNs may not advertise h3 on all paths.
Content-type validation distinguished real endpoints from soft-404 pages.
Sites that WAF-blocked all requests (CoStar, IEEE, RedCross, SEC) received 0/8. Blocking automated requests is itself an AI-readiness failure.
linkedin.com appears twice in the dataset because it was tested independently in two original site lists with different observed signals.
Scores reflect infrastructure readiness, not content quality, editorial standards, or domain authority.

Manifest SHA-256: 3aedf7a8354df104a4cb3edfad42e70027e5bbd5485becc3f174bff3d64500da