Advanced Use of Robots.txt - Meta Tag Instructions and Bandwidth
(Page 4 of 5 )
Meta Tag Instructions
With the availability of search engine robot technology, there are thousands of search engine robots. There just isn’t a way to list them all, along with their capabilities and disadvantages. Many of these lesser known robots don’t even attempt to view your robots.txt. So what do you do then? Many webmasters find it handy to be able to place a few commands directly into their meta tags to instruct robots. These tags are placed in the <head> section like any other meta tags.
<meta name="robots" content="noindex">
This meta tag tells the robot not to index this page.
<meta name="robots" content="noindex,nofollow">
This tag tells a robot should neither index this document, nor analyze it for links.
Other tags you might have use of are:
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="all">
Unfortunately, there is no way to guarantee that these less than polite robots will follow your instructions in your meta tags any more than they will follow your robots.txt. In these extreme cases, it would be to your benefit to view your server logs, find out the ip address of this erring robot, and just ban it.
Bandwidth Limitations
Another complaint for having a search engine spider crawl un-instructed lies in the area of bandwidth. A search engine spider could easily eat up a gigabyte of bandwidth in a single crawl. For those of you paying for only so much bandwidth, this could be a big, if not just expensive, problem.
Without a robots.txt file, search engine spiders will request it anyway, causing a 404 Error to be presented. If you have a custom 404 Page Not Found error page, then you are going to be wasting bandwidth. A robots.txt file is a small file, and will cause less bandwidth usage than not having one. Usually the crawl-delay directive can help with this.
Some webmasters believe that another good way to keep a search engine spider from using too much bandwidth is with the revisit-after tag. However, many believe this to be a myth.
<meta name="revisit-after" content="15 days">
Most search engine robots, like Google, do not honor this command. If you feel that Googlebot is crawling too frequently and using too much bandwidth, you can visit Google’s help pages and fill out a form requesting Googlebot to crawl your site less often.
You can also block all robots except the ones you specify, as well as provide different sets of instructions for different robots. The robots.txt file is very flexible in this way.
Next: Using Robots.txt for Corporate Security >>
More Search Optimization Articles
More By Jennifer Sullivan Cassidy