Blocking Complicated URLs with Robots.txt - More Rules
(Page 4 of 5 )
2.Blocking all of the Dynamic URLs of a single file containing different query strings
Suppose you need to block all/product.aspand you find out that the URLs are all dynamic, with query strings such as:
/product.asp?idproduct=1
/product.asp?idproduct=5
/product.asp?idproduct=3
/product.asp?idproduct=4
/product.asp?idproduct=8
This list is small, but in actual dynamic web sites it could grow to thousands of URLs. It would be impossible to list all of the URLs you want to block in the robots.txt file. The following approach, therefore, is NOT recommended:
User-agent: *
Disallow: /product.asp?idproduct=1
Disallow: /product.asp?idproduct=5
Disallow: /product.asp?idproduct=3
Disallow: /product.asp?idproduct=4
Disallow: /product.asp?idproduct=8
The advanced technique lets you block them all in one line by directly blocking the file itself and not including those query strings. So the correct robots.txt syntax for this is just:
User-agent: *
Disallow: /product.asp
The above syntax will block all of the /product.asp pages and their query-related URLs.
3. Blocking a specific folder name that may occur at different directory levels, associated with different categories and different dates in a blog structure.
The best example of this issue is Wordpress feeds URLs. Trackback URLs also follow this type of structure. Consider the example below:
http://www.thisisasampledomain.com/blog/2007/10/20/post1/feed/
http://www.thisisasampledomain.com/blog/2007/10/20/post2/feed/
http://www.thisisasampledomain.com/blog/2007/10/20/feed/
http://www.thisisasampledomain.com/blog/feed
http://www.thisisasampledomain.com/feed
This cannot be blocked properly using the syntax below:
User-agent: *
Disallow: /blog/2007/10/20/post1/feed
Disallow: /blog/2007/10/20/post2/feed
Disallow: /blog/2007/10/20/feed
Disallow: /blog/feed
Disallow: /feed
This is a more challenging problem, as/feedis associated with different posts, different dates and different categories. The above syntax can block only the feed URL in post 1 and post 2. But what if you add another post? You will keep needing to change the robots.txt file, which is not advisable.
The correct approach involves applying regular expression techniques in the robots.txt file. All /feed URLscan be block using the proper syntax below:
User-agent: *
Disallow: */feed
A similar scenario can be applied to Wordpress trackback URLs:
User-agent: *
Disallow: */trackback
Combining the two robots.txt items into one will look like this:
User-agent: *
Disallow: */feed
Disallow: */trackback
These will block all feed and trackback URLs regardless of what post title and directory levels they are in the Wordpress blog.
Next: Folder Name >>
More Search Optimization Articles
More By Codex-M