What Is a robots.txt File?

No Widget Added

Please add some widget in Offcanvs Sidebar

What Is a robots.txt File?

robots.txt

A robots.txt file is a plain text file placed at the root of your domain (e.g., https://example.com/robots.txt) that gives crawl instructions to search engine bots (Googlebot, Bingbot, etc.). It controls what can be crawled, helps conserve crawl budget, and points bots to your XML sitemap.

Important: robots.txt controls crawling, not indexing. To stop pages from appearing in search, use noindex (meta tag or HTTP header) and/or remove internal links.

Why robots.txt matters for SEO

  • Crawl budget optimization: Prioritize important URLs by blocking thin/duplicate/unnecessary paths.
  • Server performance: Reduce bot hits on resource-heavy or dynamic URLs.
  • Clean SERPs: Indirectly limits exposure of junk pages to bots.
  • Sitemap discovery: Help bots find your content faster.

Basic syntax & directives

User-agent: <bot-name or *>
Allow: <path>
Disallow: <path>
Sitemap: https://example.com/sitemap.xml
# Optional (Google ignores sometimes at scale):
# Crawl-delay: <seconds>

Rules apply top-down and the most specific match wins for a given bot.

Quick examples

1) Minimal, SEO-safe starter

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

Crawls everything; advertises your sitemap.

2) Block admin area but allow needed AJAX (WordPress)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

3) eCommerce (block faceted/filter parameters)

User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Disallow: /cart/
Disallow: /checkout/
Allow: /products/
Sitemap: https://example.com/sitemap.xml

4) SaaS app (separate marketing vs. app)

User-agent: *
Disallow: /app/
Disallow: /api/
Allow: /
Sitemap: https://example.com/sitemap.xml

5) Staging or pre-launch (crawling blocked)

User-agent: *
Disallow: /

For true privacy, also use HTTP auth; robots.txt isn’t a security feature.

6) Multilingual subfolders

User-agent: *
Disallow: /search/
Allow: /
Sitemap: https://example.com/sitemap_en.xml
Sitemap: https://example.com/sitemap_es.xml
Sitemap: https://example.com/sitemap_fr.xml

What NOT to do with robots.txt

  • Don’t block pages you want to deindex. If blocked, Google can’t see your noindex. Let them crawl, then use noindex until removed.
  • Don’t rely on it for security. Blocked URLs can still be accessed directly and might appear as “URL is not available because of robots.txt.”
  • Don’t blanket-block resources (CSS/JS). Google needs them to render pages properly; blocking can hurt rankings.
  • Don’t place it anywhere but root. Must be at example.com/robots.txt (and separately for subdomains).

How robots.txt interacts with indexing

  • Crawl allowed + indexable → page can appear in search.
  • Crawl disallowed → Google may still index the URL (without content) if linked externally.
  • To truly prevent indexing → allow crawling, then add noindex on the page or via HTTP header; or password-protect/remove it.

Testing & validation

  • Use your search engine’s robots testing tool (e.g., in Search Console).
  • Test specific user-agents (e.g., Googlebot, Googlebot-Image).
  • After changes, fetch again and monitor Crawl Stats and Coverage reports.

Best practices checklist

  • File path is /<root>/robots.txt and returns HTTP 200.
  • Include Sitemap directive(s).
  • Don’t block JS/CSS needed for rendering.
  • Disallow thin/duplicate: faceted parameters, internal search, session IDs.
  • Keep rules simple and specific; avoid overlapping patterns.
  • Version-control the file; document reasons for each rule.
  • Re-test after each deployment.

Ready-to-use templates

WordPress (general)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/sitemap.xml

WooCommerce / eCommerce

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?orderby=
Disallow: /*?price=
Disallow: /*?rating=
Allow: /product/
Sitemap: https://example.com/sitemap.xml

Blog / News

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /tag/*?*
Disallow: /category/*?*
Disallow: /?s=
Sitemap: https://example.com/sitemap_index.xml

Headless / SPA with app subpath

User-agent: *
Disallow: /app/
Disallow: /api/
Allow: /
Sitemap: https://example.com/sitemap.xml

Staging (block all)

User-agent: *
Disallow: /

(Also use password protection.)

FAQs

Q1. Where do I put robots.txt?
At the root: https://yourdomain.com/robots.txt. Subdomains need their own file.

Q2. Can I target specific bots?
Yes. Use their user-agent name:

User-agent: Googlebot
Disallow: /experimental/

Q3. Does Crawl-delay work?
Some engines use it; Google largely ignores it. Use Search Console crawl rate settings instead.

Q4. How do I stop images from Google Images?
Disallow the image paths for Googlebot-Image or use X-Robots-Tag: noimageindex on those files.

Q5. How quickly do changes apply?
Once bots recrawl robots.txt (often very fast), rules take effect. Monitor logs and Search Console.

Final tip

Start permissive (allow all + sitemap). Then block only what’s proven to waste crawl budget (internal search, filters, duplicates). Always verify with a tester, and never block assets needed for rendering.

If you share your site type and problem areas (e.g., filters, search pages, private paths), I’ll tailor a robots.txt optimized for you.

Leave a Comment

Your email address will not be published. Required fields are marked *