I, Robots.txt

In this article, we will take a look at the Robots Exclusion Standard. It sounds like something straight out of a science fiction book, but is really nothing more than a tool to prevent web spiders and robots from accessing a particular section of your website, or even your entire website if you so desire, that you don’t want indexed. The standard goes by many names, like the Robot Exclusion Protocol, but you most likely have heard of it as the robots.txt protocol. No matter what you may call it, it is a handy tool that, when used properly, can help increase your ranking with the various web pages.

The standard was created in June of 1994 to handle robots that were accessing deep virtual trees, attacking servers with a succession of rapid requests, and downloading certain files over and over again. Despite its name, the Robots Exclusion Standard is not backed by any acting body or organization. Nor is it enforced by anyone, and there are no guarantees that any present or future robots will comply with it. There is a movement involving what is known as ACAP, or Automated Content Access Protocol, that is seeking to update the Robots Standard, and perhaps govern it, but that is beyond the present scope of this article.

In order to stop web spiders and web robots (as opposed to the real world kind, of which there is no stopping) from accessing and indexing every inch of your website, you use a file known as robot.txt. As the filename suggests, robot.txt is a text file. It contains data that tells a robot whether or not it can access certain areas of your site. Whether or not it abides by your wishes is another matter, but, as you will see in a bit, most of the big search engines presently do.

You store the file in the top-level directory of your site. If you have sub-domains, then each one will require its own robots.txt file. If you exclude it, then the rules will apply for yoursite(dot)com but not for, say, sample(dot)yoursite(dot)com.

Some examples of top-level directories are:

  •   www(dot)sample(dot)com/robots.txt
  •   www(dot)devshed(dot)com/robots.txt
  •   www(dot)nerditup(dot)com/robots.txt

Examples of sub-domain directories where you would store the robots.txt

  • www(dot)your(dot)sample(dot)com/robots.txt
  • www(dot)some(dot)sample(dot)com/robots.txt
  • www(dot)bad(dot)sample(dot)net/robots.txt

Your First Steps

Your first step is to create a new text file. I use Notepad, but Word and OpenOffice work just as well, so long as you save the file as a .txt. The robots.txt file uses two basic lines, the User-Agent and the Disallow. User-Agent lists the spider or bot that you either wish to grant access to or deny access to. Disallow lists the directory or filename you wish for the bot/spider to crawl or not crawl.


If you don’t wish for any bots to to index your site, you would type the following into your text file:


User-agent: *

Disallow: /

In this example, the “*” is known as a wildcard and says that the rule applies to all bots. A wildcard is a special character that could stand for anything. In typical usage, if you write d*ng, a computer can interpret this as being: “ding”, “dang”, “dong”, “dung”, “dzing” and so forth. Simply put, the “*” could be anything.

The Disallow part says that no directory or file should be scanned. It’s important to note how this works. The patterns in the Disallow are matched by using a substring comparison. The robot sees what is written there and says, “Does this directory or file contain this?” For instance, let’s say our site is www(dot)sample(dot)com. If I have a directory called “images,” it would be listed as www(dot)sample(dot)com/images/.

In this instance, the bot sees the “/” in the www(dot)sample(dot)com/images/ and will ignore it.

To allow all bots to visit every file and directory, you would write this in your file:


User-agent: *

Disallow:

Again, User-Agent uses the wildcard to say that whatever is in the Disallow line applies to all bots. Since the Disallow is blank, there is nothing to match, and so all files and directories are available.

If you want every bot to ignore one directory, we would write:


User-agent: *

Disallow: /images/

Again, the wildcard says all bots should follow the Disallow. The Disallow asks the bots to stay away from /images/. If the bots are compliant, they won’t scan this directory or the files therein. Note again that I wrote “/images/” and not “/image”. You always want to include that final forward slash (/).

To tell all bots not to scan a specific file, we use this code:


User-Agent: *

Disallow: /images/biggorillaonatricycle.jpg

Now all bots should scan everything except the biggorillaonatricycle image. When it finds that picture in the “image” directory, it looks away, even though, let’s face it, who wouldn’t want to see that? An important thing to note here is that if we had, say, a secondary directory (named "imagestwo" perhaps) that held some photos and included the same picture, the bots would still scan that one, unless you told them otherwise.

Here is how you could make it so that neither of the pictures of our buddy the gorilla riding on his tricycle get scanned:


User-agent: *

Disallow: /images/biggorillaonatricycle.jpg

Disallow: /imagestwo/biggorillaonatricycle.jpg

This rule applies to directories as well:


User-agent: *

Disallow: /images/

Disallow: /imagestwo/

Disallow: /aboutus/

The above tells all bots to ignore the three directories. Note that we can also mix our directories and files together:


User-agent: *

Disallow: /images/

Disallow: /imagestwo/

Disallow: /aboutus/wearereallyevil.html

So far we have focused primarily on how to limit files. Now we will work with limiting specific bots from accessing our files.

If we want to tell one specific bot to stay out of all of our directories, we input the following code into our robots.txt file:


User-agent: Google-Bot

Disallow: /

Now Google should stay away from all of our directories.

We can also tell a specific bot to ignore one or more directories or files, like so:


User-agent: NerdBot

Disallow: /images/

Disallow: /secrets/globaldomination.html

And finally, if we want to specify that several bots are not allowed to access a directory or file, we can do so in this manner:


User-agent: NerdBot

Disallow: /images/

Disallow: /secrets/globaldomination.html


 

User-agent: FatBot

Disallow: /images/

Disallow: /secrets/globaldomination.html

Disallow: /tmp/


 

User-agent: HedonismBot

Disallow: /images/

Disallow: /secrets/globaldomination.html


 

User-agent: Bender

Disallow: /images/

Disallow: /secrets/globaldomination.html

Disallow: /cgi-bin/

Whenever you add a bot, you must include a space between your line, which tells the interpreter that this is a new record.

There are several directives you can use that may or may not be supported by the different search engines. They are listed below:

Crawl-delay

If you want to set the number of seconds between recurrent requests to the same server, you can do so by using Crawl-delay. Here it is in action:


User-agent: *

Crawl-delay: 60

Or


User-agent: FatBot

Crawl-delay: 120

The first example makes all bots wait 60 seconds. The second one makes FatBot wait two minutes before doing a recurrent request.

Using Sitemaps Auto-Discovery

This handy dandy little guy allows you to tell the bot where your list of URLs are. You can add it anywhere in your file, like so:

Sitemap: http://www(dot)sample(dot)com/sitemap(dot).xml

Allow

Allow is a nifty directive that works by letting you specify that a bot can look at certain files within a disallowed directory. Let’s say that you have disallowed an image directory, but there is a file in that directory you decide later on that you would like to have indexed. Instead of having to block every other file in the directory, you can simply do this:


User-agent: *

Disallow: /images/

Allow: /images/mefeedingorphans.jpg

Now all agents will be able to enter your /images/ directory, but they will only look at the file(s) you tell them to.

Commenting

You can leave comments in your robots.txt files by preceding them with a pound(#) symbol, like so:


# Here is a comment

User-agent: * # all bots should follow the disallow

Disallow: /images/ # no bot should access the images directory

Conclusion

Well, that’s it for this article. There are still more features of robots.txt to discuss, like the Robots Meta Data Tag, and more issues to speak of, like NoFollow and ACAP, all of which we will cover in a future article.

Till then…

Google+ Comments

Google+ Comments