A robots.txt file is a plain text file placed at the root of your domain (e.g., https://example.com/robots.txt
) that gives crawl instructions to search engine bots (Googlebot, Bingbot, etc.). It controls what can be crawled, helps conserve crawl budget, and points bots to your XML sitemap.
Important: robots.txt controls crawling, not indexing. To stop pages from appearing in search, use
noindex
(meta tag or HTTP header) and/or remove internal links.
Why robots.txt matters for SEO
- Crawl budget optimization: Prioritize important URLs by blocking thin/duplicate/unnecessary paths.
- Server performance: Reduce bot hits on resource-heavy or dynamic URLs.
- Clean SERPs: Indirectly limits exposure of junk pages to bots.
- Sitemap discovery: Help bots find your content faster.
Basic syntax & directives
User-agent: <bot-name or *>
Allow: <path>
Disallow: <path>
Sitemap: https://example.com/sitemap.xml
# Optional (Google ignores sometimes at scale):
# Crawl-delay: <seconds>
Rules apply top-down and the most specific match wins for a given bot.
Quick examples
1) Minimal, SEO-safe starter
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
Crawls everything; advertises your sitemap.
2) Block admin area but allow needed AJAX (WordPress)
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
3) eCommerce (block faceted/filter parameters)
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Disallow: /cart/
Disallow: /checkout/
Allow: /products/
Sitemap: https://example.com/sitemap.xml
4) SaaS app (separate marketing vs. app)
User-agent: *
Disallow: /app/
Disallow: /api/
Allow: /
Sitemap: https://example.com/sitemap.xml
5) Staging or pre-launch (crawling blocked)
User-agent: *
Disallow: /
For true privacy, also use HTTP auth; robots.txt isn’t a security feature.
6) Multilingual subfolders
User-agent: *
Disallow: /search/
Allow: /
Sitemap: https://example.com/sitemap_en.xml
Sitemap: https://example.com/sitemap_es.xml
Sitemap: https://example.com/sitemap_fr.xml
What NOT to do with robots.txt
- Don’t block pages you want to deindex. If blocked, Google can’t see your
noindex
. Let them crawl, then usenoindex
until removed. - Don’t rely on it for security. Blocked URLs can still be accessed directly and might appear as “URL is not available because of robots.txt.”
- Don’t blanket-block resources (CSS/JS). Google needs them to render pages properly; blocking can hurt rankings.
- Don’t place it anywhere but root. Must be at
example.com/robots.txt
(and separately for subdomains).
How robots.txt interacts with indexing
- Crawl allowed + indexable → page can appear in search.
- Crawl disallowed → Google may still index the URL (without content) if linked externally.
- To truly prevent indexing → allow crawling, then add
noindex
on the page or via HTTP header; or password-protect/remove it.
Testing & validation
- Use your search engine’s robots testing tool (e.g., in Search Console).
- Test specific user-agents (e.g.,
Googlebot
,Googlebot-Image
). - After changes, fetch again and monitor Crawl Stats and Coverage reports.
Best practices checklist
- File path is
/<root>/robots.txt
and returns HTTP 200. - Include Sitemap directive(s).
- Don’t block JS/CSS needed for rendering.
- Disallow thin/duplicate: faceted parameters, internal search, session IDs.
- Keep rules simple and specific; avoid overlapping patterns.
- Version-control the file; document reasons for each rule.
- Re-test after each deployment.
Ready-to-use templates
WordPress (general)
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/sitemap.xml
WooCommerce / eCommerce
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?orderby=
Disallow: /*?price=
Disallow: /*?rating=
Allow: /product/
Sitemap: https://example.com/sitemap.xml
Blog / News
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /tag/*?*
Disallow: /category/*?*
Disallow: /?s=
Sitemap: https://example.com/sitemap_index.xml
Headless / SPA with app subpath
User-agent: *
Disallow: /app/
Disallow: /api/
Allow: /
Sitemap: https://example.com/sitemap.xml
Staging (block all)
User-agent: *
Disallow: /
(Also use password protection.)
FAQs
Q1. Where do I put robots.txt?
At the root: https://yourdomain.com/robots.txt
. Subdomains need their own file.
Q2. Can I target specific bots?
Yes. Use their user-agent name:
User-agent: Googlebot
Disallow: /experimental/
Q3. Does Crawl-delay
work?
Some engines use it; Google largely ignores it. Use Search Console crawl rate settings instead.
Q4. How do I stop images from Google Images?
Disallow the image paths for Googlebot-Image
or use X-Robots-Tag: noimageindex
on those files.
Q5. How quickly do changes apply?
Once bots recrawl robots.txt
(often very fast), rules take effect. Monitor logs and Search Console.
Final tip
Start permissive (allow all + sitemap). Then block only what’s proven to waste crawl budget (internal search, filters, duplicates). Always verify with a tester, and never block assets needed for rendering.
If you share your site type and problem areas (e.g., filters, search pages, private paths), I’ll tailor a robots.txt optimized for you.