Data collection, as mentioned in the first part, will be done by Crawl Track, a web analytic script used to detect search engine bot visits to the test pages. The time frame of the experiment covered October 17, 2009 through November 20, 2009, more than a month of data collection and observation.
After logging in to the Crawl Track webmaster dashboard and visiting the “Crawler” section, you can see the complete list of bots that visits the website:
So for example, if I want to check the pages crawled by the Slurp Inktomi (Yahoo bot crawler), all I have to do is click; the detailed URLs that were visited will be revealed. Since the test pages have some crawl track code embedded in them using PHP, they will be traced using the above technique also.
All tests and experiments are hosted at www.php-developer.org . If you are ready to see the results, keep reading.
The first question we asked was whether or not the main three search engine bots will refuse to crawl and index a page (“Page 2″ as illustrated below) referenced by a “rel=nofollow” attribute hyperlink.
Affected URL: http://www.php-developer.org/Linkrelnofollow.php (page 2 in legend)
Illustration of the question:
Page 1 (Home page) link with REL=”NOFOLLOW” attribute link to ? Page 2 (Crawlable and indexable page)
The URL that is referenced by the rel=nofollow is http://www.php-developer.org/Linkrelnofollow.php. Checking the crawl track logs reveals that no search engine bots ever visited this URL.
This proves that all three main search engines (Yahoo, MSN and Google) obey the “rel=nofollow” link attribute.
I found that the same thing is true when checking to see if pages are indexed. None of the three main search engines indexed the URL: http://www.php-developer.org/Linkrelnofollow.php
So it is safe to say now that, to prevent search engine bots from crawling the target page, one of the effective methods is to use the rel=nofollow attribute, which is done at the link level. Of course, this technique cannot prevent the target page from being crawled and indexed, especially if there are links pointing to it from other domains.
The second question we asked was whether the main three search engine bots crawl links on a page on which is placed the <meta name="robots" content="noindex"> tag and end up indexing “Page 3″ (see illustration below).
The third question we asked was whether, using the illustration below, all main search engines would crawl and index Page 2.
http://www.php-developer.org/noindexexperiment.php (Target URL, page 2 in legend)
http://www.php-developer.org/noindexlinktarget.php Target URL (inner second hyperlink, page 3 in legend)
Illustration of the question:
Page 1 (Home page) link to ——–> Page 2 (Tag with <meta name="robots" content="noindex">) link to ——> Page 3 (Crawlable and indexable page)
It is surprising to know that only Googlebot ended up crawling this URL: http://www.php-developer.org/noindexlinktarget.php, but all three of them crawl the referencing URL: http://www.php-developer.org/noindexexperiment.php even though noindexexperiment.php has <meta name="robots" content="noindex"> on it.
http://www.php-developer.org/noindexexperiment.php is not indexed in all three main search engines. They are all consistent and obey the <meta name="robots" content="noindex"> tag.
Only Google and Yahoo index this URL: http://www.php-developer.org/noindexlinktarget.php , while Bing ignores it. It is odd to know that Yahoo is not detected crawling this URL: http://www.php-developer.org/noindexlinktarget.php , but ends up indexing it. Although the linking URL is crawled (noindexexperiment.php), no inbound links from other domains are detected pointing to http://www.php-developer.org/noindexlinktarget.php , at the time this data has been gathered.
Our fourth question concerned whether the main search engine bots crawl links on “Page 2″ (see illustration below) which includes a <meta name="robots" content="noindex, nofollow"> tag. This is similar to the above question, but includes a “nofollow” in the meta robots tag.
Our fifth question: using the illustration below, do all main search engines crawl and index “Page 3″?
http://www.php-developer.org/noindexnofollow.php (Target URL, “page 2″ in the legend)
http://www.php-developer.org/noindexnofollowtarget.php Target URL (“page 3″ in the legend, inner second hyperlink)
Illustration of the question:
Page 1 (Home page) link to ——–> Page 2 (Tag with <meta name="robots" content="noindex, nofollow">) link to ——> Page 3 (Crawlable and indexable page)
Only Googlebot and MSN bot crawled the linking/reference URL: http://www.php-developer.org/noindexnofollow.php which has <meta name="robots" content="noindex, nofollow"> on it.
No search engine bots crawled the inner target URL: http://www.php-developer.org/noindexnofollowtarget.php, even though it is completely indexable/crawlable. This is because it is referenced by an un-crawlable, un-indexable URL: http://www.php-developer.org/noindexnofollow.php .
Lots of SEO practitioners hypothesized <meta name="robots" content="noindex, nofollow"> to be the same as <meta name="robots" content="noindex">, but they behave differently in Google.
Even if the page has <meta name="robots" content="noindex"> , Google still follows all the links on that page and will probably index the target URLs referenced from a <meta name="robots" content="noindex"> page, as answered in Question #2 and #3.
As expected, no search engines indexed the URL with the noindex/nofollow tag: http://www.php-developer.org/noindexnofollow.php , although Googlebot and MSN bot crawled that URL. Crawling and indexing are two entirely different search engine processes.
And since http://www.php-developer.org/noindexnofollowtarget.php is referenced by it, it is also not indexed by all three major search engines. Bear in mind that this URL: http://www.php-developer.org/noindexnofollowtarget.php can still be indexed if there are links pointing to this page from other domains.
For our sixth question, we wanted to know if some engine bots crawl and index “Page 2″ which is blocked in robots.txt? (See illustration below)
Likewise, for our seventh question, using the illustration below, we asked whether some search engine bots crawl and index “Page 3.”
Illustration of the question:
Page 1 (Home page) link to ——–> Page 2 (This page blocked by robots.txt) then link to ——> Page 3 (Crawlable and indexable page)
As expected, no search engine bots crawled the blocked URL: http://www.php-developer.org/blockedbyrobots.php , so they obey the robots.txt well.
Since http://www.php-developer.org/blockedrobotslink.php is referenced by the blocked URL which is not crawled, http://www.php-developer.org/blockedrobotslink.php is also not crawled by all three main search engine bots.
It is hard to believe that Google indexes both URLs:
Both Yahoo and MSN do not index any of the above URLs. This means Google treats the blocked URL differently. Even though it won’t come out as a crawled URL in the logs (see above result), from the fact that there are so many links pointing to the blocked URL, Google alone can index those URLs just because of the referenced links.
Conclusions and Recommendations
So many applications can arise from learning the results of this experiment thoroughly, but the most important are as follows:
Preventing duplicate content issues in Oscommerce/other similar CMS-based powered template/websites. Since these templates use a lot of product/content categories and product/content pagination which are highly similar to each other and do not need to be indexed, any SEO professional can simply suggest: <meta name="robots" content="noindex"> in the categories/pagination URLs. This will let search engines ignore the duplicate content URLs (categories/pagination) but still allow them to index the product URLs/inner important content or posts (with the exception of the Bing search engine; see results above).
Completely preventing search engines from indexing a particular page with sensitive content. Now that we know that blocking URLs using robots.txt can still make the URLs appear in search engine results, the best method is to place: <meta name="robots" content="noindex, nofollow"> on URLs if you want them to never be indexed at all by search engines. But do note that they will also never follow links on that page, so if you have important/indexable URLs deeper in the site’s structure, search engines may never crawl and index it.
Saving bandwidth consumed by search engine bots on unimportant URLs. The best approach for this is to use robots.txt. This is because the top three main search engine bots (Google, Yahoo and MSN) will never crawl URLs blocked by robots.txt; this is proven by the experiment. Bear in mind that Googlebot will still find and indexed URLs found on the robots.txt blocked pages, so if you have sensitive data, this may concern you.