Robot.txt

There are several reasons and scenarios where we need to control the access of the web robots or web crawlers or simple spiders, to our website or a part of our website. Like Google-bot (Google Spider) visiting our website, spam bots too will visit. Spam bots usually visit and collect private information from our website. When a robot crawls our website it uses a considerable amount of the website’s bandwidth too! It is easy to control robots by disallowing the access of the web robots to our website through the usage of a simple ‘robots.txt’ file.

Creating a robots.txt:

Open a new File in any Text Editor Like Notepad.

The rules in the robots.txt file are entered in a ‘field’: ‘value’ pair.

<field>:<value>

<field>

Can have possible two values: allow or disallow for a particular URL.

<value>

A URL or URI that the access or rule is specified.

Examples:

To exclude all the search engine robots from indexing our entire website.

User-agent: *
Disallow: /

To exclude all the bots from a certain directory within our website.

User-agent: *
Disallow: /aboutme/

To disallow multiple directories.

User-agent: *
Disallow: /aboutme/
Disallow: /stats/

To control access to specific documents.

User-agent: *
Disallow: /myFolder/name_me.html

To disallow a specific search engine bot from indexing our website,

User-agent: Robot_Name
Disallow: /

Advantages of Using Robots.txt:

  • Can avoid the wastage of resources.
  • Can save Bandwidth
  • Can remove Clutter and complexity from Web Statistics and more smooth anlytics
  • Refusing a specific Robots