Advanced Use of Robots.txt - MSNbot, Slurp, Googlebot, IA
(Page 2 of 5 )
MSNbot
MSN’s search engine robot is called MSNbot. The MSNbot has quite a voracious appetite for spidering websites. Some webmasters love it and try to feed it as much as possible. Other webmasters don't see any reason to use up bandwidth for a search engine that doesn't bring them traffic. Either way, MSNbot will not spider your website unless you have the robots.txt. Once it finds your robots.txt, it will wander the site, almost timidly at first. Then MSNbot builds up courage and indexes files rapidly. So much so, that use of the crawl-delay directive is recommended with this robot. I’ll cover this more later.
Recent events could be the cause of this. Several months ago, MSN received many complaints that MSNbot was ignoring directives written into the robots.txt files, such as crawling directories it has been instructed to stay out of. Engineers looked into the problem, and I believe they changed a few things to help control this type of behavior with the robot.
In the process, they may have changed it in such a way as to instruct the MSNbot to follow the robots.txt to the letter, and for websites that didn’t have one, it probably got confused and just left, not having a letter of the law to go by. While this is probably mere speculation, the spidering behavior of the robot seems to fit this assessment.
Yahoo’s Inktomi Slurp
Yahoo incorporated the use of Inktomi’s search engine crawler, and is now known as Slurp. Inktomi/Yahoo's Slurp seems to gobble greedily for a couple of days, disappear, come back, gobble more, and disappear again. Without the robots.txt, however, it will crawl fairly slowly, until it just kind of fades away, unless it finds great, unique content. But still, without the presence of the robots.txt, it may not crawl very deeply into your website.
Googlebot
On Google’s website, they instruct webmasters on the use of the robots.txt, and recommend that you do so. SEOs know that Google’s “guidelines” for webmasters are actually more like step by step directions on how to optimize for the search engine. So if Google makes mention of the robots.txt, then I would definitely follow those recommendations to a T.
Google will crawl a site, robots.txt or no, sporadically either way, but it will heed the instructions in the file if it is there. Googlebot has been known to only crawl one or two levels deep without the presence of the robots.txt file.
IA_Archiver
Alexa’s search engine robot is called ia_archiver. It is an aggressive spider with a big appetite; however it is also very polite. It tends to limit its crawls to a couple hundred pages at a time, crawling without using extraneously large amounts of bandwidth, and slow enough as to not overload the server. It will continue its crawl over a couple of days, and then come back after that fairly consistently as well. So much so, that by analyzing your web stats, you can almost predict when ia_archiver will perform its next crawl. Alexa’s ia_archiver obeys the robots.txt commands and directives.
There are many other spiders and robots that exhibit particular behaviors when crawling your site. The good ones will follow the robots.txt directives, and many of the bad ones will not. Later, I’ll show you a few ways to help prevent some problems you might encounter from search engine robots, and how to utilize your robots.txt to help.
Next: Advanced Robots.txt Commands and Features >>
More Search Optimization Articles
More By Jennifer Sullivan Cassidy