Blocking Complicated URLs with Robots.txt - Google Webmaster tools Robots.txt Analysis Tool
(Page 3 of 5 )
Google's webmaster tools include the very important robots.txt analysis tool which will help webmasters test their robots.txt file before uploading it to the server. The objective is to check whether those URLs are blocked as intended and to check for syntax errors.
The advantage is that you can test any web site's robots.txt in the tool, even if the site is not verified in your webmaster tools account. You just need a Google webmaster tools account.
To use the robots.txt tool, first add the website URL (only the homepage URL) on the dashboard, then click "add site." After the site is added, click the website in the dashboard, then "Tools," and finally, "Analyze Robots.txt."
Below is what the robots.txt analysis tool looks like in Google Webmaster Tools:

It is highly important to know that Google is case sensitive in blocking URLs, so if you have blocked /Folder,/foldercan still be indexed by Google because the "f" is lower case and you've only blocked the one with the upper case "F."
Also, since the homepage URL is the most important part of any web site, it is highly recommended that you always include them in the "Test URLs against this robots.txt file" analysis.
What are complicated URLs? What are the rules for blocking items with robots.txt?
Complicated URLs are often dynamic URLs, and therefore the type of URLs that cannot be blocked by ordinary robots.txt syntax. Below I've listed difficult URLs commonly found in e-commerce sites and blogs:
1.Blocking/folder/to avoid duplicate content with /folder/default.aspand there other files under/folder. This is tricky, though it looks uncomplicated, as this creates a conflict with/folder. In the Microsoft IIS structure,/folderand/folder/default.aspare one. Assuming you have other files in the /folder such as:
/folder/fileone.asp
/folder/filetwo.asp
If you use the syntax below, it blocks the entire contents of /folder; all files will not be indexed, which is not correct.
User-agent: *
Disallow: /folder
To block properly, you need to use the Allow command.
User-agent: *
Disallow: /folder/
Allow: /folder/default.asp
Allow: /folder/fileone.asp
Allow: /folder/filetwo.asp
The above syntax should block only /folder/ and not affect all the files under it. But please note that Google may not find these files. Therefore, in your homepage, you should always include a consistent navigation link pointing to those files so that they can be crawled.
The only disadvantage with this technique is that if you add new files under/folder, you will need to update your robots.txt file so that they will be indexed by Google.
Next: More Rules >>
More Search Optimization Articles
More By Codex-M