Advanced Use of Robots.txt - Using Robots.txt for Corporate Security
(Page 5 of 5 )
Using Robots.txt for Corporate Security
While some of you are familiar with a company called Perfect 10 and its security issues, some are not. Perfect 10 is an adult company with copyrighted pictures of models. They filed a preliminary injunction against Google in August of 2005. According to BusinessWire.com, “The motion for preliminary injunction seeks to enjoin Google from copying, displaying, and distributing Perfect 10 copyrighted images. Perfect 10 filed a complaint against Google, Inc. for copyright infringement and other claims in November of 2004. It is Perfect 10's contention that Google is displaying hundreds of thousands of adult images, from the most tame to the most exceedingly explicit, to draw massive traffic to its web site, which it is converting into hundreds of millions of dollars of advertising revenue. Perfect 10 claims that under the guise of being a "search engine," Google is displaying, free of charge, thousands of copies of the best images from Perfect 10, Playboy, nude scenes from major movies, nude images of supermodels, as well as extremely explicit images of all kinds. Perfect 10 contends that it has sent 35 notices of infringement to Google covering over 6,500 infringing URLs, but that Google continues to display over 3,000 Perfect 10 copyrighted images without authorization.”
What is interesting in this situation is that the blame actually lies with Perfect 10, Inc. The company failed to direct the search engine to stay out of its image directory. Two simple lines in a robots.txt file on their web server would have easily barred Google from indexing these images in the first place, a practice which Google themselves mention in their guidelines for webmasters.
User-agent: Googlebot-Image
Disallow: /images
One good piece of advice given in an SEO forum is this: “If you want to keep something private on the web, .htaccess and passwords are your friends. If you want to keep something out of Google (or any other search engine), robots.txt and meta tags are your friends. If someone can type a URL into a browser and find your page, don't count on a secret URL remaining secret. Use passwords or robots.txt to protect data.”
Using robots.txt to keep search engines out of sensitive areas is a simple task, and a step that every webmaster has use of. Search engines have been known to index members-only areas, development documents, and even employee personnel records. It is the responsibility of the webmaster to ensure the protection of their sensitive data and copyrighted material. A search engine spider cannot be expected to know the difference between copyrighted material and other data, especially when it makes it clear what would be an easy deterrent to this type of behavior. This is one of the many consequences a webmaster will face if they do not utilize their robots.txt file.
Between Clint’s article and this one, I hope you understand the importance of using a robots.txt on your web server. Ultimately, it’s up to you to help control the behaviors of search engine robots when spidering your site’s pages. Using robots.txt is easy, and there is no excuse for lack of security, spider bandwidth issues or not getting indexed because you failed to do this simple thing. If you need help generating a robots.txt, there are many websites that give you step by step instructions, or can even generate the file for you. With this powerful tool at your disposal, you need to make use of it. It’s your own fault if you don’t.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |