Write a Robots.txt File

One of the most fundamental steps when optimizing a website is writing a robots.txt file. It helps tell spiders what is useful and public for sharing in the search engine indexes and what is not. It should also be noted that not all search spiders will follow your instructions left in the robots.txt file. In addition, a poorly done robots.txt file can stop the search spiders from crawling and indexing your website properly. In this article I will show you how to be sure everything will work correctly.

While there are many other SEOs who will tell you that a robots.txt file will not improve your rankings, I would disagree, in order for the robots to index your site properly, they need instruction on which folders or files to not crawl or index, as well as which ones you want to have indexed.

Another good reason to use the robots.txt file is because many of the search engines tell the public to use them on their websites. Below is a quote taken from Google:

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.

Even though others feel this is of no use unless you are blocking content, keep this in mind; when a search engine goes out of their way (and this is the tightest-lipped search engine ever) to tell us to use something, it is usually to ones advantage to follow the little clues we are offered.

Also if you read your stats file on your web hosting server, you will usually find the URL to your robots.txt being requested. If a search bot asks for the robots.txt and does not find it on your server, the spider often just leaves.

I am including a screen shot from my own web hosting stats. As you can see below, the robots.txt file is #14 of the top URLs requested on my site. Keep in mind, no human visitor is looking at that file, yet it ranks better than a lot of the human visited pages. Now if the bots want that file that much, it is something everyone should be using.


Side Note:

The # 2 requested URL is sitemap.xml. That URL is not where my visitors sitemap is located, but rather the xml file for Google Sitemap Implementation. So as can be seen, Google is downloading this file almost daily.

How do you build a robots.txt file for your website? I am glad you asked. One thing you do not want to do is use an HTML editor to build this file. The easiest way to create the file is with a text editor like Notepad. After opening Notepad (or another text editor), save the blank file as robots.txt. This file will be placed on the root level of your web server, or in other words the same folder as your index page, once it is complete.

Now I will cover several different methods of efficiently using a robots.txt file to direct the robot to crawl the correct directories and and avoid others.

First we will discuss how to format information. The text file is actually a list. Its directions consist of two fields, or lines of instruction.

The first line is the User-agent line. This is the line where you can specify which search spider bots are allowed to index your site(s). The second line is the directive line or disallow field. This is the line you will use to block folders or files blocked from spiders.

Of particular note: if the publishers of Perfect 10 magazine (an online porn magazine suing Google for linking to their images) had used the robots.txt file, they could stop search spiders from indexing their images. Is it Googles’ fault the magazine hired incompetent IT staff? I don’t think so. To me it’s another adult webmaster looking for more free publicity.

To write the robots.txt file, you would start by addressing specific search engines. The User-agent line would start as:

User-agent:

Adding a specific search engines spider name here will give the search spider notice that it is to follow the next line for instruction, i.e.:

User-agent: googlebot

This tells googlebot that it is to follow the next line’s directions on how to proceed through your website, or to leave altogether. You may also employ the use of an asterisk (*) as a wildcard for all search spiders.

The second line known as the directive is written as:

Disallow:

By adding a folder after the Disallow statement, the search spider should ignore the folder for indexing purposes and move to others where there is no restriction.

Disallow: /images/

This is a special example, just for Perfect 10. This one minute bit of instruction could have saved a ton in wasted legal fees on a frivolous lawsuit. As this is a basic step in building websites, it is incumbent on website owners to protect their intellectual property, and not a 3rd party search engines duty.

You can also disallow specific files this way

Disallow: cheeseyporn.htm

One way I recommend using this all the time is to keep robots out of you cgi bin directory

Disallow: /cgi-bin/

If you leave the Disallow directive line blank or not filled in, this indicates that ALL files may be retrieved and or indexed by specifiedl robot(s). This would let all robots index all files.

User-agent: *
Disallow:

And vice versa you can keep all robots out easily.

User-agent: *
Disallow: /

In the example above, the one forward slash (/) equals your root directory. Since the root directory is blocked, none of the other folders and files can be indexed or crawled. Your site will be removed from search engines once they read your robots.txt and update their indexes.

You can provide multiple Disallows to one User-agent. In the following example, all spiders will be told not to index the cgi-bin and the images directories.

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

We can also use the robots.txt file to help improve search engine rankings that we may have achieved with a dynamic page such as php. Googlebot may have problems with them if there are too many variables in the Session IDs of the URL.

A URL with session IDs will look similar to the below:

http://www.yourcoolsite.com/cat.php?par=887&show=subcats?=0431Tr

If your cool website is written in php and is converted into HTML pages for googlebot to index, the robot will still try to index the php pages. After copying the pages from php to HTML, place each set of pages in their own folder. Title them something easy for you to remember. Place all the php pages into a folder named “php.” This will allow you to leave the HTML pages under the root directory which is easily indexed by the spiders.

Then using what you have learned so far, implement the following in your robots.txt file:

User-agent: googlebot
Disallow: /php/

Now we have kept googlebot out of the php pages, which the bot usually has problems crawling. It leaves the spider to crawl the more friendly html pages, and it will not see your original content duplicated on your site between the php and html versions. If the pages are cleanly coded, this will often result in improved rankings in all three of the major search engines.

You can also use comments in your robots.txt file, but you need to be careful of where they are used.

Disallow: /images/ #comment send googlebot away

We could run into a problem if a search spider bot attempts to disallow /images/#comment, which is a not a folder on the server and would more than likely tell the bots to just leave the website altogether.

It is better to leave your comments on their own separate line. See the example below.

#keeps googlebot out of my porn
User-agent: googlebot
Disallow: /images/

So as we can see there is a very valid and legitimate reason to use the robots.txt file. There are also numerous other times to use the file. In some cases it could stop a large company from looking like fools for not protecting their intellectual property, and in others it would stop sensitive data from being crawled and indexed over the internet, and also to help a site increase positions in the natural organic search results listings.

After you have written your robots.txt file and placed it on your server, you should validate it with one of the robots.txt validation tools online.

Google+ Comments

Google+ Comments