Search Engines and Your Right to Privacy - What It Means to Index Everything
(Page 2 of 5 )
What we also know from this is that Google indexes indiscriminately. Sure, there’s a large amount of the internet that has robots.txt files designed to turn away all robots that come their way, but that percentage is small. In a world where the internet commerce is largely ad driven and everyone is trying to make a buck, disallowing Google, MSN Search, or Yahoo isn’t very helpful. Heck, most of us are trying to optimize our sites for these engines to maximize the amount of people reading our sites. In fact, some might even argue that the percentage of the internet that is not indexed is shrinking because the internet is growing. Also, there is a large number of publishers who do not know how to prevent their sites from being crawled by bots or simply do not care. Anything that these bots do see is indexed. It doesn’t matter if the content is relevant, factual, or even correct. If I wrote a page explaining why pi should equal 3, the search engines will pick it up. If enough people link to it, it might make the first page of Google searches about pi.
As a result, one’s “online presence” is like a credit rating; it’s impossible to really know what it is until asked. Unless someone points out that there is a page of slanderous and fictional content on the internet that is aimed at you, it is not entirely likely that you are going to find it.
Therein lies one of the major problems with the search engines: accuracy. If people are going to do research on others by putting a name into a search engine, there’s no way for the engine to know if the site’s content is accurate. This is not to be confused with the engine’s ability to verify that the site did contain some information at a particular time.
For example, the Internet Archive’s Wayback Machine has been used in many court cases in the US, as described by this Wall Street Journal article. In fact, in one case between a Polish TV channel provider and EchoStar, representatives from the Internet Archive signed affidavits that say the content of the archive is accurate to the best of their knowledge. Playboy is also a regular user of the Wayback Machine to check for copyright infringement. However, what has not been publicly tested is how the robots handle HTTP 301 (Moved Permanently), HTTP 302 (Found) and HTTP 307 (Temporary Redirect) situations, or other server-side settings which are designed to automatically redirect a bot or a browser to a different web page. In theory, someone could put up a site called http://georgebushrules.com and redirect it to pornographic content and the archive may not know. This is why the affidavits have a clause that says “to the best of their knowledge”, since there is no way for a human to verify all of the data that the Internet Archive’s bot is pulling in.
Next: Liability? >>
More Search Engine News Articles
More By Quantum Skyline