Using the X-Robots-Tag HTTP Header Specifications in SEO: Tips and Tricks

The X-robots-tag is a powerful tool for controlling search engine bots. It is different from certain more popular solutions, such as robots.txt and the meta robots tag. Actually, the X-robots-tag allows a great deal of flexibility and full control of search engine bots crawling your website’s content — even more than when using robots.txt and the meta robots tag.

It is surprising that it’s not very popular among SEOs and web masters. A lot of SEOs are still not fully familiar with this protocol, especially the beginners. This is a beginner tutorial on the X-robots tag. By the end of this tutorial you will fully understand how it works and know how to use it.

Comparing the X-robots-tag with robots.txt and the meta robots tag

You might have observed already that using robots.txt and the meta robots tag means dealing with  some limitations. The table in the screen shot below might help you review the limitations of robots.txt and the meta robots tag.

In the first comparison, it is true that you can use a robots.txt file to block the search engine bots (such as Googlebot) from crawling and indexing any type of content on your website. This content can be a web page (.html), images, audio/video or documents.

However, using the meta robots tag, you cannot fully implement this blocking with all types of content since images, audio/video and documents (word file for example) is not HTML-based content, and does not have the <head> tag.

In the second comparison, you might find it necessary to have some of your content not be cached by search engines. If you successfully block the pages or content using robots.txt, then a “cache” link will not be found in search results for that content.

Bear in mind, though, that it will still appear in the search engine results in the form of a “reference link.” Here is a screen shot illustrating that point:

The above content is blocked with robots.txt using the following syntax:

User-agent: *
Disallow: /wp-
Disallow: /2009/

However, for HTML pages, you can use the meta robots tag to prevent the search engines from caching the page in the search results. This can be done by using the noarchive tag in the <head> section, for example:

<meta name="googlebot" content="noarchive">

In the last comparison, robots.txt and meta robots.txt cannot be used to prevent both the sitemap.xml of your website, as well as the robots.txt, from being indexed and cached by search engines.

This is because you cannot block the robots.txt on its own. If you deny access to it using .htaccess directives, then search engines will also be blocked from accessing your entire website. The same is true with sitemap.xml. If you block sitemap.xml in robots.txt, Googlebot will not be able to fetch the contents of your sitemap.xml, which can be useful.

An error will appear in your Google Webmaster Tools, telling you that sitemap.xml cannot be accessed by Googlebot because it is blocked in robots.txt

The X-robots-tag is the answer to the limitations of robots.txt and the meta robots tag. With it, you can:

  • Prevent the search engine bots from indexing any type of content (whether it is HTML, images, audio/video and documents.

  • Prevent search engines from caching any type of content, whether it is HTML, images, audio/video and documents.

  • Prevent search engines from indexing or caching robots.txt and sitemap.xml.

There are still a lot of applications for the X-robots-tag. Some of them will be covered in this tutorial.

If you are an SEO newbie or someone not familiar with this protocol, then you might find it hard to understand how the X-robots-tag works.

The X-robots-tag can be found in the HTTP response header of any of your website content. Okay, so you might ask, what is an “HTTP response header”?

To make things easy for you to understand, you need to be familiar with how the HTTP request and response header works:

Step 1. Suppose you are browsing the web and visit a particular website (for example, you plan to visit the home page content of website X).

Step 2. You then type the home page URL for website X into the browser address bar.

Step 3. After you press enter, your browser makes an “HTTP Request” to website X’s server.

Step 4. Website X’s server returns a “response header” that contains information pertaining to the website content being requested. The X-robots-tag can be included in this header response.

To see the HTTP request and response headers, you can use Firebug: http://getfirebug.com/
Install it in your Firefox browser. Now suppose that you want to visit the home page URL of www.webmasterworld.com. Enter this URL in the browser address bar and hit enter.

Now go to View –> check Firebug. You should see the Firebug console; it appears in the lower portion of your browser. Click the ”Net” tab in the Firebug console. You will see that it’s still empty, because no request and response have been received at the moment.

Now reload the page. You should see a lot of progress under the “Net” console. Expand “GET wwww.webmasterworld.com,” and you should see this result (screen shot):

In the response headers above, you can see that the home page of webmasterworld.com is using:

X-Robots-Tag: noarchive

This tells search engines not to cache the page in the search engine results. Well, if you want to confirm this, Google does not actually show the cache link for Webmaster World’s home page: http://bit.ly/fRnQAV

Basic introduction to X-Robots-Tag Syntax

Just like robots.txt, the X-robots-tag has its own syntax that you need to properly observe to implement it correctly.

One of the best resources for the complete technical syntax and commands of the X-robots-tag can be found here: http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html

Look under “Using the X-robots-Tag HTTP Header” section. Below is a short summary of the most important tags from examples on that page:

1. Preventing search engines from indexing the page

X-Robots-Tag: noindex

2. Preventing search engines from caching the page

X-Robots-Tag: noarchive

3. Preventing ONLY googlebot from indexing the page’s links

X-Robots-Tag: googlebot: nofollow

4. Instructing Googlebot not to index the page and links:

X-Robots-Tag: googlebot: noindex, nofollow

X-Robots-Tag Implementation Tips

X-robots-tag can be implemented by using a server side scripting language, such as PHP, or by using .htaccess in Apache server.

First example: Declaring the X-robots tag using PHP to tell search engines not to index the page, as well as the links.

<?php
//The header function should be inserted at the top of your page, above all of the other PHP or HTML code.

header("X-Robots-Tag: noindex, nofollow", true);

//Your other PHP code here

?>

Suppose you have a PHP website and you are concerned that search engines might index your search result pages. And your search result pages use the search.php template.

You will need to edit your search.php template and put in the X-robots-tag PHP code above. You can verify by checking response headers outputted by Firebug.

Be careful, because a single mistake in the coding (just as with robots.txt) can cause your website not to be indexed by search engines.

Second example: Using the X-robots-tag in .htaccess. If you use Apache server, then you can declare the X-robots-tag in .htaccess.

Suppose you would like your website not be cached by search engines (just like the webmasterworld.com example). Download your main .htaccess file (the one that resides on the root directory of your website). Now add the following line:

Header set X-Robots-Tag "noarchive"

Another useful implementation would be to prevent caching of non-HTML pages, such as PDF, mp3, etc. You can do that in .htaccess. The example below prevents search engines from caching a PDF document:

<Files *.pdf>
  Header set X-Robots-Tag "noarchive"
</Files>

To prevent robots.txt and XML sitemaps from appearing in search engines, but allow Googlebot to still fetch its content:

<Files sitemap.xml>
Header set X-Robots-Tag "noindex"
</Files>
<Files robots.txt>
Header set X-Robots-Tag "noindex"
</Files>

The X-robots-tag can be abused by some spammers — for example, by tricking the link exchange partner into believing that the page is indexed/followed in Google, but it is actually “nofollowed” using the X-robots tag directive:

<Files linkpage.php>
Header set X-Robots-Tag "nofollow"
</Files>

If you are still exchanging links and expect to get some link juice from it, always check the response header of the link page to make sure it is not tagged with the X-robots-tag “nofollow.”

Of course it is easy to check that the link page is not nofollowed by:

  • Checking the head tag (for meta robots).

  • Checking the hyperlink for the presence of rel=nofollow attribute.

  • Finally, seeing that it is not blocked with robots.txt.

However, for page links that are nofollowed using the X-robots-tag, it is almost impossible to detect, unless you check the page’s HTTP response headers.

Implementation Recommendations

First, keep in mind that robots.txt is enough for most bot blocking applications. In fact, if you are not technically familiar with PHP and Apache web server directives like .htaccess, you might need to stick to robots.txt and the meta robots tag. Only use the X-robots-tag if you need to, for those things that cannot be done by robots.txt or other methods.

You can check the accuracy and correctness of robots.txt by using Google Webmaster Tools, but you can’t use GWT to check the X-robots-tag. You need to be careful when implementing this in your website, especially sitewide (where it affects all content).

By limiting the use of the X-robots-tag to only what is really necessary, you decrease the chances of implementation errors that can be costly for your website.

Finally, bear in mind that using the X-robots-tag in .htaccess requires your Apache server to have the “Header” directive enabled. You need to confirm this with your hosting provider, as not all hosts will enable this.

Google+ Comments

Google+ Comments