Polite Bots (Page 1 of 4 )
If you've ever wondered how to get a little better control over what parts of your web site get crawled by the search engines, how they crawl your pages, and how to encourage them to visit, keep reading. This article will explain the various protocols that the search engine robots (particularly Google's) follow. It will also touch upon ways to help you guard against scraper bots.
Polite Bots
There have been quite a number of articles on the Robots.txt primer. All have explained the basics of the robots exclusion protocols. Recently while working on removing some pages from Google's archives, I browsed through Google's Webmaster Central Blog over at blogspot and saw some posts by Dan Crow and Vanessa Fox. These posts explained how the Googlebot worked in detail.
Apart from explaining the robots exclusion protocol in detail, Google has new tools which allow the removal of cached pages using the Webmaster Dashboard -- we will only cover that briefly in this piece since I go into detail about it in a different article. This article will look at the specifics of the robots.txt primer specifically for the Googlebot, quoting Dan Crow, Google product manager. Google's bot is incredibly polite when it is indexing pages; we will compare its behavior to that of some malicious scraper bots.
Googlebot has several quirks to it, as all bots do. We will look at a few of these quirks before we discuss the basics of search engine bots. For example if you have your web site down temporarily and you want Googlebot to come back you can use an HTTP 503 command to tell the bot (and your users) that your network is temporarily unavailable. Without this command it is probable that Googlebot will index your "this website is down for maintenance" page. You can get more information on the HTTP 503 status code at askapache.com.
Also note that if the Googlebot is crawling your site too frequently (and hence grabbing all your bandwidth), you can contact Google Support; they should work with you to ensure that the bots don't overload your servers. According to Vanessa Fox, there probably will be a tool that allows you to adjust the crawl rate of the Googlebot on your site.
Googlebot is Google's primary agent in crawling and indexing pages on the web; it's incredibly large, truly living up to the name World Wide Web. As Dan Crow puts it, it's "really, really big." And not every one on the public web wants particular pages crawled. There are pages containing client information or inflammatory material. Some don't mind the crawling but don't want to be cached on Google's database for whatever reason.
Next: The Basics >>
More Website Submission Articles
More By Akinola Akintomide