There are several reasons and scenarios where we need to control the access of web robots or web crawlers or simple spiders, to our website or a part of our website. Like Google-bot (Google Spider) visiting our website, spam bots too will visit. Spam bots usually visit and collect private information from our website. When a robot crawls our website it uses a considerable amount of the website’s bandwidth too! It is easy to control robots by disallowing the access of the web robots to our website through the usage of a simple ‘robots.txt’ file.
Creating a robots.txt:
Open a new File in any Text Editor Like Notepad.
The rules in the robots.txt file are entered in a ‘field’: ‘value’ pair.
<field>:<value>
<field>
Can have possible two values: allow or disallow for a particular URL.
<value>
A URL or URI that the access or rule is specified.
Examples:
To exclude all the search engine robots from indexing our entire website.
User-agent: * Disallow: /
To exclude all the bots from a certain directory within our website.
User-agent: * Disallow: /aboutme/
Disallow multiple directories.
User-agent: * Disallow: /aboutme/ Disallow: /stats/
To control access to specific documents.
User-agent: * Disallow: /myFolder/name_me.html
To disallow a specific search engine bot from indexing our website,
User-agent: Robot_Name Disallow: /
Advantages of Using Robots.txt:
- Can avoid the wastage of resources.
- Can save Bandwidth
- Can remove Clutter and complexity from Web Statistics and more smooth analytics
- Refusing a specific Robots