Creating robots instruction document - Robots.txt
September 24, 2006 – 9:52 pmIn earlier post I have mentioned that using a robot directive file - robots.txt you can control whether a document can be crawled by a search engine or not.
Generally if there is no robots.txt file in the root of your website or it has no instructions written in it, it is interpreted by the crawler that there is no restriction for crawling documents within that site.
On the other hand you may want to hide some documents from being crawled.
Following example show how you can block a document private-data.html from being crawled by bots.
For this you write the robot instruction as below and save in the robots.txt file.
User-Agent: *
Disallow: /private-data.html
Now see the first line, it says which bots should follow these instruction? * indicates that it is applicable for all bots. you can replace it by Googlebot for the crawler used by google alone. Knowing the user agent, such instruction can be written for any bot.
See the second line, it says not to crawl the document private-data.html which is found in the root of the website.
If you want to hide all documents within a directory from crawl, you can write the second line as
Disallow: /directory
Now how these instructions are taken into effect?
Crawler bots of major search engines are programmed to check the instructions found in the robots.txt file before accessing any document in the site in a crawl session. It can be said as a rule for the crawlers. So according to the instructions written, bots continue the crawl process.