Post on X Share on LinkedIn
Portfolio About 24hrs Services White Label Free Tools Blog FAQ Contact Get on Call
Back to Blog
GEO

llms.txt Explained - The 2026 Standard for AI Crawler Control

robots.txt was built for Google. llms.txt is being built for ChatGPT, Perplexity, and Claude. Here is the spec, the file format, and when you actually need one.

llms.txt Explained - The 2026 Standard for AI Crawler Control

OpenAI's GPTBot, Anthropic's ClaudeBot, and Perplexity's PerplexityBot now account for a measurable share of automated traffic on most content sites. Robots.txt was designed for one job: tell crawlers which paths to skip. It was never built to describe what your content means. That gap is what llms.txt is trying to close.

The proposal, originally drafted by Jeremy Howard, has gained adoption across documentation sites, SaaS marketing pages, and a growing set of agency portfolios. Anthropic, Mintlify, Cloudflare, and Hugging Face all publish one. If you care about being cited by AI search, ignoring this file is the same mistake sites made when they ignored sitemap.xml in 2008.

What llms.txt actually is

llms.txt is a single markdown file at your site root, served from /llms.txt. It contains a curated index of pages an AI crawler should read to understand what your site offers. Unlike sitemap.xml, which lists every URL, llms.txt is selective. The signal is not "here is everything" but "here is what matters."

The format is plain markdown. An H1 with the site name. A short paragraph describing what the site does. Then sections with H2 headings and bullet lists of links, each link followed by a one-line description. There is also a richer variant called llms-full.txt that includes the full text of each indexed page so a model can ingest the whole site without crawling each URL separately.

The file format, with a real example

A minimal valid llms.txt looks like this:

# SARVAYA

> SARVAYA is a digital agency building websites, web apps, and AI-driven systems for businesses worldwide.

## Services

- [Web Development](https://sarvaya.in/#services): Full-stack web and mobile applications.
- [White Label Services](https://sarvaya.in/whitelabel): Outsourced production for agencies.
- [24-Hour Website](https://sarvaya.in/24hrs): One-day website delivery.

## Articles

- [SEO in 2026](https://sarvaya.in/blog/seo-in-2026): What actually moves rankings now.
- [White Label Growth Hack](https://sarvaya.in/blog/white-label-growth-hack): Why agencies use outsourcing.

That is the entire spec in working form. No XML, no schema validation, no required metadata block. The constraints are deliberate: a junior engineer should be able to write the file by hand in an afternoon, and an LLM should be able to read it without any preprocessing.

llms.txt vs robots.txt - they solve different problems

Treat them as complements, not competitors. Robots.txt is a deny-list. llms.txt is an allow-and-describe list. Most sites need both.

If your robots.txt blocks AI crawlers entirely, llms.txt is meaningless. The two files only work together when AI agents can actually reach the URLs llms.txt advertises.

llms.txt is what robots.txt would have looked like if the people writing it expected the reader to be a model that needs context, not a worker that just needs a path list.

What to put in llms.txt that earns AI citations

The mistake most teams make is dumping their full nav into llms.txt. That is not the point. AI crawlers already follow links. What they cannot do reliably is decide which of your 200 pages is the authoritative source on a topic.

Lead with your strongest content. For a digital agency that means the portfolio as proof of work, the services overview as commercial intent, and three or four blog posts that demonstrate genuine expertise. For a SaaS, lead with API docs, pricing, a quickstart, and a couple of architectural overviews. Each link should answer one question crisply. The description after each link is what an AI model will quote when summarising your site.

Where to host the file and how to verify it works

Drop the file at the root of your domain. It must be served as plain text or markdown with a 200 status. Cache headers should be conservative because you will edit it more often than robots.txt. Run two checks after deploy:

  1. Manual fetch. curl https://yourdomain.com/llms.txt should return the file content, not your 404 page or a redirect to the homepage. SPAs that intercept all routes break this constantly.
  2. Crawler simulation. Use a tool like Cloudflare's bot logs or your own access logs to confirm that GPTBot, ClaudeBot, and PerplexityBot are fetching the file. If they are not, your robots.txt is probably blocking them upstream.

Common mistakes that kill the signal

Three failure modes show up in almost every site we audit. The first is shipping llms.txt with broken links to staging or unpublished content. AI crawlers treat broken links as a quality signal against the entire file. The second is overstuffing with marketing language. "Our innovative platform delivers value across the customer journey" tells a model nothing useful and gets pruned from any extracted summary. Use plain, concrete descriptions: "REST API documentation, including authentication and rate limits." The third is forgetting to update it. Treat llms.txt the same way you treat your sitemap. It belongs in your CI pipeline.

Should you ship one this quarter

If your analytics show any traffic from ChatGPT, Perplexity, or Google AI Overview referrals, the answer is yes. The file takes an afternoon to write and costs nothing to maintain. The downside risk is zero. The upside is that AI agents start citing your canonical pages instead of guessing which subpage to read. We ship one with every site we build at SARVAYA, including the 24-hour website service. The longer you wait, the more your competitors are training the next generation of AI search results without you.