These are no simple questions. For example, do other search engines behave the same way as Google when they crawl a page tagged with <meta name="robots" content="noindex, nofollow">? In other words, do they not index it? And does Googlebot follow a hyperlink on a page blocked by robots.txt?
These are just some of the teaser questions that might only require some “common sense” to answer, but may not actually be true in reality. Expert opinions push professionals to think like philosophers, so if someone asked “Is there any chance of indexing a page blocked by robots.txt?” the common sense answer is NO…but is this a fact? Is this true in all scenarios? Sometimes a fact is hard to believe, yet it remains a “fact” and not just an opinion. Facts come from actual testing.
Another difficult question that can be answered by common sense but may not be actually true is this: “Does Googlebot follow a link on a <META NAME="ROBOTS" CONTENT="NOINDEX"> page?" The common sense answer is no, since the page will not be indexed by Google. Again, this has not been thoroughly tested to determine if it is true in all cases, especially with different search engine bots.
This is the objective of this two-part tutorial, to let you know how search engine bots currently behave under different meta robots tag and robots.txt conditions. These two are highly powerful tools for SEO and web content security/privacy. They are used to prevent duplicate content, give some hint as to how search engine bots might crawl the pages, and prevent search engine bots from crawling sensitive pages not suitable for indexing or being shown in the search engine results pages.
The approach to finding the answers is to conduct a controlled experiment to test the behavior of search engine bots.
The study aims to investigate the factual answers to the following questions in the main three search engines, namely Google, Yahoo and Bing:
Page 1 (Home page) link with REL=”NOFOLLOW” attribute link to Page 2 (Crawlable and indexable page)
Question 1: Do the main three search engine bots not crawl and index a page (“Page 2” as illustrated above) referenced by a “rel=nofollow” attribute hyperlink?
Page 1 (Home page) link to —> Page 2 (Tag with <meta name="robots" content="noindex">) —> link to Page 3 (Crawlable and indexable page)
Question 2: Do the main three search engine bots crawl links on a page which includes a <meta name="robots" content="noindex"> tag and end up indexing “Page 3”?
Question 3. Using illustration 2 above, do all main search engines crawl and index Page 2?
Page 1 (Home page) link to ——–> Page 2 (Tag with <meta name="robots" content="noindex, nofollow">) link to ——> Page 3 (Crawlable and indexable page)
Question 4: Do the main search engine bots crawl links on “Page 2,” which includes a <meta name="robots" content="noindex, nofollow"> tag? This is similar to the above question, but includes a “nofollow” in the tag.
Question 5: Using illustration 3 above, do all main search engines crawl and index “Page 3”?
Page 1 (Home page) link to ——–> Page 2 (This page blocked by robots.txt) then link to ——> Page 3 (Crawlable and indexable page)
Question 6: Do some engine bots crawl and index “Page 2”?
Question 7: Using illustration 4 above, do some search engine bots crawl and index “Page 3”?
The host domain where the experiment to be conducted is http://www.php-developer.org/ ; this domain is frequently indexed by Google, at an average of 16 pages crawled per day. A high crawl rate is desirable since results can be obtained in a shorter time frame.
To maximize the chances of crawling and indexing of test pages by search engine bots, it is highly recommended to place links as part of the consistent navigation menu, particularly starting at the home page.
To answer the seven questions above, test pages (using the .php extension for the reasons discussed below) needs to be created with the following set up:
Implementing the above test pages requires a special tracking system. The script used to detect the search engine bots’ visits is “Crawl Track,” an open source web analytics script. All of the tracking codes must be embedded in the test pages. These tracking codes require that the pages use the .php extension for easy integration.
Also, to prevent a spam false alarm by search engine bots, the test pages should be filled with useful content aiming to educate readers and providing a short background on the experiment. The content is unique, and test pages should also use unique and accurate titles. This will ensure that search engine bots will see these as authentic and important URLs that will be crawled and indexed.
We also must not allow links to these pages from other domains or include the URLs in the sitemap (both the text/html and xml version). This will ensure that the pages ares completely independent of other crawling/indexing factors EXCEPT the solely navigational link (see the screenshot below; starting at the home page and placed in the entire website navigation menu). As designed in this experiment, that will be the only motivational factor by search engine bots to visit the test pages. This will eliminate biases.
Within the red box below are the navigational links found on all pages of http://www.php-developer.org/ pointing to the test pages used in this experiment. There are four hyperlinks, using the following URLs and anchor text:
The first hyperlink’s scope is to answer our first question.
Anchor text used: Link rel nofollow Experiment
Target URL: http://www.php-developer.org/Linkrelnofollow.php
The second hyperlink’s scope is to answer our second and third questions.
Anchor text used: Meta robots noindex tag
Target URL: http://www.php-developer.org/noindexexperiment.php
Inner hyperlink anchor text: another page
Target URL (inner second hyperlink): http://www.php-developer.org/noindexlinktarget.php
The third hyperlink’s scope is to answer our fourth and fifth questions.
Anchor text used: Noindex Nofollow tag
Target URL: http://www.php-developer.org/noindexnofollow.php
Inner hyperlink anchor text: page
Target URL (inner second hyperlink): http://www.php-developer.org/noindexnofollowtarget.php
The fourth hyperlink’s scope is to answer our sixth and seventh questions.
Anchor text used: Blocked by robots
Target URL: http://www.php-developer.org/blockedbyrobots.php
Inner hyperlink anchor text: page
Target URL (inner second hyperlink): http://www.php-developer.org/blockedrobotslink.php
For the fourth hyperlink, before the test pages has been uploaded to the test server, the robots.txt was formulated and thoroughly tested using the robots.txt analysis tool in Google Webmaster Tools. Only when the page was surely blocked was it finally uploaded; this will prevent any accidental indexing of search engine bots to the fourth hyperlink test page due to using the wrong robots.txt syntax.
Now that the test page has been completely set up, it needs more than a month to fully capture the results. This is because, even though the site is crawled frequently, we need to allow more time for other slow search engine crawlers.
To obtain the correct data, we need to define as early as possible the differences between crawling and indexing. Crawling is when search engines actually visit the page to fetch content; this will be detected by the Crawltrack tracking script embedded in the pages.
Different search engine crawlers use different user agent names, which can be differentiated easily in the report. It is important to note the correct user agent in order to gather the correct crawling information. Here is a list of actual user agent names of the main search engine bots:
Google search engine: http://www.google.com
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Yahoo search engine: http://www.yahoo.com
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
Bing search engine: http://www.bing.com/
Indexing is when the crawled pages (pages fetched during the crawling process) are actually placed in the search engine index, ready to be shown any time on the search engine result pages when a relevant query matches that document.
In part two, we will present the results of this test.