Page 1 (Home page) link to --------> Page 2 (This page blocked by robots.txt) then link to ------> Page 3 (Crawlable and indexable page)
Screenshot:
Crawling results:
As expected, no search engine bots crawled the blocked URL: http://www.php-developer.org/blockedbyrobots.php , so they obey the robots.txt well.
Since http://www.php-developer.org/blockedrobotslink.php is referenced by the blocked URL which is not crawled, http://www.php-developer.org/blockedrobotslink.php is also not crawled by all three main search engine bots.
Indexing results:
It is hard to believe that Google indexes both URLs:
Both Yahoo and MSN do not index any of the above URLs. This means Google treats the blocked URL differently. Even though it won't come out as a crawled URL in the logs (see above result), from the fact that there are so many links pointing to the blocked URL, Google alone can index those URLs just because of the referenced links.
Conclusions and Recommendations
So many applications can arise from learning the results of this experiment thoroughly, but the most important are as follows:
Preventing duplicate content issues in Oscommerce/other similar CMS-based powered template/websites. Since these templates use a lot of product/content categories and product/content pagination which are highly similar to each other and do not need to be indexed, any SEO professional can simply suggest: <meta name="robots" content="noindex"> in the categories/pagination URLs. This will let search engines ignore the duplicate content URLs (categories/pagination) but still allow them to index the product URLs/inner important content or posts (with the exception of the Bing search engine; see results above).
Completely preventing search engines from indexing a particular page with sensitive content. Now that we know that blocking URLs using robots.txt can still make the URLs appear in search engine results, the best method is to place: <meta name="robots" content="noindex, nofollow"> on URLs if you want them to never be indexed at all by search engines. But do note that they will also never follow links on that page, so if you have important/indexable URLs deeper in the site's structure, search engines may never crawl and index it.
Saving bandwidth consumed by search engine bots on unimportant URLs. The best approach for this is to use robots.txt. This is because the top three main search engine bots (Google, Yahoo and MSN) will never crawl URLs blocked by robots.txt; this is proven by the experiment. Bear in mind that Googlebot will still find and indexed URLs found on the robots.txt blocked pages, so if you have sensitive data, this may concern you.