There have been quite a number of articles on the Robots.txt primer. All have explained the basics of the robots exclusion protocols. Recently while working on removing some pages from Google’s archives, I browsed through Google’s Webmaster Central Blog over at blogspot and saw some posts by Dan Crow and Vanessa Fox. These posts explained how the Googlebot worked in detail.
Apart from explaining the robots exclusion protocol in detail, Google has new tools which allow the removal of cached pages using the Webmaster Dashboard — we will only cover that briefly in this piece since I go into detail about it in a different article. This article will look at the specifics of the robots.txt primer specifically for the Googlebot, quoting Dan Crow, Google product manager. Google’s bot is incredibly polite when it is indexing pages; we will compare its behavior to that of some malicious scraper bots.
Googlebot has several quirks to it, as all bots do. We will look at a few of these quirks before we discuss the basics of search engine bots. For example if you have your web site down temporarily and you want Googlebot to come back you can use an HTTP 503 command to tell the bot (and your users) that your network is temporarily unavailable. Without this command it is probable that Googlebot will index your "this website is down for maintenance" page. You can get more information on the HTTP 503 status code at askapache.com.
Also note that if the Googlebot is crawling your site too frequently (and hence grabbing all your bandwidth), you can contact Google Support; they should work with you to ensure that the bots don’t overload your servers. According to Vanessa Fox, there probably will be a tool that allows you to adjust the crawl rate of the Googlebot on your site.
Googlebot is Google’s primary agent in crawling and indexing pages on the web; it’s incredibly large, truly living up to the name World Wide Web. As Dan Crow puts it, it’s "really, really big." And not every one on the public web wants particular pages crawled. There are pages containing client information or inflammatory material. Some don’t mind the crawling but don’t want to be cached on Google’s database for whatever reason.
You need to be able to control what gets seen and what does not get seen on your web site. Some pages on your site will contain sensitive information or content which has to be paid for by the visitor before viewing. There also may be some pieces of personal information which you simply don’t want to be archived on the search engines. The most common way to handle this is by using the robots exclusion protocol. The basic form of the robots.txt file for Googlebot is this:
Apart from this form of the exclusion protocols (which is done by you saving the above command in notepad as robots.txt and uploading the file into your root directory), you can put in the meta tags a command disallowing a certain bot from indexing a certain page
<meta name="googlebot" content="noindex">
This covers the basics. Now we will delve into some details of the robots exclusion protocols. Note that we are dealing specifically with the Googlebot; for a list of other bots from other search engines you can go to http://www.robotstxt.org/, but this article simply deals with the robots exclusion protocols as explained over at the Google Webmaster Blog (hopefully in a much simpler manner).
Why bother with another robots exclusion protocols article? This one is all about Googlebot from those at the Googleplex, should clarify a few interesting questions such as issues webmasters have over "conflicting values" and answer questions about how exactly search engines handle meta tags (especially Google).
According to Vanessa Fox, if you stuff your meta tags with conflicting values, such as putting index and no index, Google always follow the most restrictive value.
<META NAME="ROBOTS" CONTENT="NOINDEX">
<META NAME="ROBOTS" CONTENT="INDEX">
In the above, the "noindex" value will be followed (the most restrictive). Also if the above meta tag is written as
<META NAME="ROBOTS" CONTENT="NOINDEX" CONTENT="INDEX">
Google still follows the most restrictive value. Vanessa Fox also mentioned that Google "recommends" that you place all content values in one meta tag. This makes it easier for the Googlebot to read the values and reduces "chances of conflict." However, whether you have one meta tag containing your content values or you have two, Google aggregates and reads them alike.
If the meta tag and the robots.txt clash, as they would in situations where you don’t exclude a file in the robots.txt file but then exclude it in a meta tag in the file, Google still follows the most restrictive value. In this case, that would be the meta tag, so note that once a file is blocked in robots.txt it is never crawled by the Googlebot. Some valid meta tag values are:
- NOINDEX - Prevents a file in a website from being indexed.
- NOFOLLOW - Prevents the Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link).
- NOARCHIVE - Prevents the web page from being cached.
- NOSNIPPET - Prevents any description from appearing below the page listing in the SERPs; also prevents the page from being cached.
- NOODP - Prevents the Open Directory Project description of the page from being used in the description that appears below the page listing in the search results.
- NONE - Equivalent to "NOINDEX, NOFOLLOW."
Note that the "NONE" content value means "NOINDEX", "NOFOLLOW" and that if it is included in your meta tags, your page won’t get crawled at all.
<META NAME="ROBOTS" CONTENT="NONE">
You pretty much exclude ALL the bots when you put the "none" value in your Meta tags. So much for exclusion protocols; let’s see if there are ways to get the Googlebot to come over.
If you check most of the top bots of Google, Yahoo, MSN and then the alternative search engines, Ask and Snap, you will discover that the ones you will see the least of in your server logs are Ask’s and Snap’s. Ask is notoriously hard to trigger if you are an obscure site in fact. Google Is pretty much "all over the place," and the same is true for MSN and Yahoo.
ODP listings or using Adsense on your site will bring Googlebot over. Google will almost always index your site; maybe I am a bit relaxed over this because I have never had issues getting my pages indexed. But, if all fails, put a line of Adsense on your page, or create a blog on Blogspot and link to your web page; the robot will follow.
Robots are created by humans, so a robot simply does what its human programmer wants it to do. Some humans beings are more immoral than others, and write impolite scraper bots. Scraper bots are programs which crawl the hyper text structure of the web, looking for security flaws in order to access sensitive files.
I have had a fascination with protecting web sites against malicious bots, especially on sites where access to the content is restricted to members. Someday, when I have perfected a good system to keep scraper bots away I will write a piece on it. A good way to protect your files is to put them in a file which requires a user name and password (and which sets cookies on the user’s PC) every time a request is made to it. Another way to protect your super sensitive information is to have a directory with a dynamically changing password, whose password changes with each request (apart from your own admin password).
Make sure you track your users’ behavior and if you notice any such malicious bot, you can list it on http://www.robotstxt.org/ or check to see if you can make a complaint about a malicious program hacking your site (don’t forget to note the host and the IP address!). If you don’t keep an eye on your server logs you may never notice that you have been crawled by a malicious bot.
Note that you don’t need to go to these levels to protect your files against the search engine bots. They are extremely polite and will definitely back off at the first sign of a restriction.