Blocking Complicated URLs with Robots.txt

If you have a large web site, you might have some content that you do not want the search engines to index — perhaps for duplicate content reasons or you simply don’t want someone casually stumbling across it. You know you can use robots.txt, but what if you need to block thousands of pages or block only certain files within a folder? This article will explain some of the more advanced uses of robots.txt. You will even learn how to block dynamic pages!

An Overview of the Robots.txt File

Robots.txt is one of the most important files to place in a web server. Basically, below are the main uses of robots.txt file:

  1. It will tell the bots which URLs cannot be crawled.

  2. When the bots receive these restrictions, they will focus on crawling those parts of your web site which are not restricted.

The main uses are very simple, but actually using robots.txt could be complex and making a mistake with it can banish your site in the search engine index. The objective of this article is to provide advanced techniques in robots.txt for blocking complicated URLs.

To use a robots.txt file you need server access or FTP access. This tutorial assumes you satisfy the following requirements:

  1. You have a web site and you have full control of it with FTP access for the root directory.

  2. You have registered your web site with Google Webmaster tools.

Any webmaster can control which parts of their website can be crawled; the problem is the syntax of the robots.txt file. It is sometimes difficult to create the robots.txt syntax correctly without proper training and tools. After you finish reading this tutorial, you should have full knowledge of how to handle robots.txt.

The basic syntax of robots.txt is:


User-agent: *

Disallow: /file or folder to be blocked

Allow: /file or folder to be allowed


This file should be uploaded to the root directory of your website to properly function and avoid conflicts. The first line,  User-agent: * , means that the syntax will be applied to all bots. To avoid serious problems, it is highly recommended that you use “*” in the user-agent as blocking several bots (instead of all of them at once) increases the risk of mistakes and makes the search engines doubtful of your site content.

This discussion focuses more on using the robots.txt file for Google. Principles learned from this article can still be applied to other search engines, such as Yahoo; however the practical examples to be illustrated will be tested in Google.

As discussed, robots.txt is one of the most powerful web server file for the following reasons:


  1. Since it provides instructions to search engine robots and other bots that tells them which directories and files in your web server are not to be crawled, you save a lot of bandwidth for the web site. Thus, you can divert that saved bandwidth for other purposes, such as improving the experience of your visitors, multi-media applications and other uses. Bandwidth is expensive for a web site, especially if you have a lot of visitors.

  2. Also, since search engine robots know what parts of your website are to be crawled, they will index your site in a very efficient manner. Crawling efficiency is very important for big sites with frequently-updated content. E-commerce websites that add new products on a daily basis can benefit from frequent crawling. This results in pages that will appear early in the search engine index, thus helping to increase the number of visitors to your web site.

  1. No one will steal your protected content in the search engine results. For example, let us suppose you are a professional photographer with a lot of photos saved on your web server. If you are not using robots.txt, search engine bots can crawl every part of your web site, and it is highly possible that they might crawl and index your protected pictures. Then, when someone searches for images using a search engine (like Google or Yahoo), they might see your pictures and use them elsewhere — such as their own web site — or even alter and/or sell them without your permission!

Google’s webmaster tools include the very important robots.txt analysis tool which will help webmasters test their robots.txt file before uploading it to the server. The objective is to check whether those URLs are blocked as intended and to check for syntax errors.

The advantage is that you can test any web site’s robots.txt in the tool, even if the site is not verified in your webmaster tools account. You just need a Google webmaster tools account.

To use the robots.txt tool, first add the website URL (only the homepage URL) on the dashboard, then click “add site.” After the site is added, click the website in the dashboard, then “Tools,” and finally, “Analyze Robots.txt.”

Below is what the robots.txt analysis tool looks like in Google Webmaster Tools:



It is highly important to know that Google is case sensitive in blocking URLs, so if you have blocked  /Folder , /folder can still be indexed by Google because the "f" is lower case and  you’ve only blocked the one with the upper case "F."

Also, since the homepage URL is the most important part of any web site, it is highly recommended that you always include them in the “Test URLs against this robots.txt file” analysis.

What are complicated URLs? What are the rules for blocking items with robots.txt?

Complicated URLs are often dynamic URLs, and therefore the type of URLs that cannot be blocked by ordinary robots.txt syntax. Below I’ve listed difficult URLs commonly found in e-commerce sites and blogs:

1. Blocking /folder/ to avoid duplicate content with /folder/default.asp and there other files under /folder . This is tricky, though it looks uncomplicated, as this creates a conflict with /folder . In the Microsoft IIS structure, /folder and /folder/default.asp are one. Assuming you have other files in the /folder such as:

/folder/fileone.asp

/folder/filetwo.asp


If you use the syntax below, it blocks the entire contents of /folder; all files will not be indexed, which is not correct.


User-agent: *

Disallow: /folder


To block properly, you need to use the Allow command.


User-agent: *

Disallow: /folder/

Allow: /folder/default.asp

Allow: /folder/fileone.asp

Allow: /folder/filetwo.asp


The above syntax should block only /folder/ and not affect all the files under it. But please note that Google may not find these files. Therefore, in your homepage, you should always include a consistent navigation link pointing to those files so that they can be crawled.

The only disadvantage with this technique is that if you add new files under /folder , you will need to update your robots.txt file so that they will be indexed by Google.

2. Blocking all of the Dynamic URLs of a single file containing different query strings

Suppose you need to block all /product.asp and you find out that the URLs are all dynamic, with query strings such as:

/product.asp?idproduct=1

/product.asp?idproduct=5

/product.asp?idproduct=3

/product.asp?idproduct=4

/product.asp?idproduct=8


This list is small, but in actual dynamic web sites it could grow to thousands of URLs. It would be impossible to list all of the URLs you want to block in the robots.txt file. The following approach, therefore, is NOT recommended:

User-agent: *

Disallow: /product.asp?idproduct=1

Disallow: /product.asp?idproduct=5

Disallow: /product.asp?idproduct=3

Disallow: /product.asp?idproduct=4

Disallow: /product.asp?idproduct=8


The advanced technique lets you block them all in one line by directly blocking the file itself and not including those query strings. So the correct robots.txt syntax for this is just:


User-agent: *

Disallow: /product.asp


The above syntax will block all of the /product.asp pages and their query-related URLs.


3 . Blocking a specific folder name that may occur at different directory levels, associated with different categories and different dates in a blog structure.


The best example of this issue is Wordpress feeds URLs. Trackback URLs also follow this type of structure. Consider the example below:


http://www.thisisasampledomain.com/blog/2007/10/20/post1/feed/

http://www.thisisasampledomain.com/blog/2007/10/20/post2/feed/

http://www.thisisasampledomain.com/blog/2007/10/20/feed/

http://www.thisisasampledomain.com/blog/feed

http://www.thisisasampledomain.com/feed


This cannot be blocked properly using the syntax below:


User-agent: *

Disallow: /blog/2007/10/20/post1/feed

Disallow: /blog/2007/10/20/post2/feed

Disallow: /blog/2007/10/20/feed

Disallow: /blog/feed

Disallow: /feed


This is a more challenging problem, as /feed is associated with different posts, different dates and different categories. The above syntax can block only the feed URL in post 1 and post 2. But what if you add another post? You will keep needing to change the robots.txt file, which is not advisable.

The correct approach involves applying regular expression techniques in the robots.txt file. All  /feed URLs can be block using the proper syntax below:

User-agent: *

Disallow: */feed


A similar scenario can be applied to WordPress trackback URLs:


User-agent: *

Disallow: */trackback


Combining the two robots.txt items into one will look like this:


User-agent: *

Disallow: */feed

Disallow: */trackback


These will block all feed and trackback URLs regardless of what post title and directory levels they are in the WordPress blog.


4. Blocking a particular part of the overall folder name

Examples of this include the following:

http://www.thisisasampledomain.com/(X(zjaksjjwsdjwjehrhejjdjhfhrhe))/folder/productinfo.aspx?id=201

http://www.thisisasampledomain.com/(X(tyntnrnendnfngnrnennwnswme))/folder/productinfo.aspx?id=205

http://www.thisisasampledomain.com/(X(yturnjfhdjwhdgdbvfvgcbdbsbae))/folder/productinfo.aspx?id=306


And depending on the site’s purpose, there may be thousands of them. That would make it impossible to list them one by one in the robots.txt file. The correct approach is to identify a unique pattern.

Based on the above URLs, there is a particular part of the URL that is repetitive. This is /(X

However, since /(X is associated with different URLs and different query strings, it cannot be blocked using the ordinary robots.txt syntax. This means we must once again make use of regular expressions.

Since we are only interested in blocking all those URLs containing /(X , we can use this an exact match like:

User-agent: *

Disallow: /(X(*/

The above syntax will block all dynamic URLs beginning with  /(X somewhere in the folder name. This is a very useful approach for big dynamic websites infected with massive duplicate content.

Important: Always test your robots.txt file using Google Webmaster tools before uploading it to your root directory to see if it blocks the URLs you intend to block and does not affect other URLs.

Google+ Comments

Google+ Comments