Search Engines and Your Right to Privacy

Some recent events question the use of search engines to acquire information. In the great war of search engines and algorithms, index sizes and ad revenue, the major search engines are after anything and everything they can get their hands on. This should be of no surprise to anyone. In order to show users the best and brightest of the internet, the search engines have to have an intimate knowledge of the entire internet. In the process of collecting all that information, however, the engines come across some pages which probably shouldn’t be indexed.

Note that I said “probably” here.  As with anything that is a judgment call, what one person considers acceptable is morally reprehensible to another.  It’s the shades of grey that make this a bit harder to analyze.

CNET proved this when they published their article that mentioned a lot of personal information about Google CEO Eric Schmidt.  In it, they wrote about his financial worth, his appearance, where he lives, and his hobbies.  This was used in a piece designed to say that Google needs to take care of privacy issues, but it did have a chilling effect since Google refused media contact with CNET for about two months.

The CNET fiasco did not show us anything new.  In fact, it showed two things that most take for granted.  All of us know of stories where one tries to Google oneself and then marvels at the results.  Searching my name produces all kinds of interesting results: from doctors to actors to hockey players, none of which are actually me.  There are also plenty of stories where background checks for employment or tenancy involve that exact same procedure.  This is especially true if an interviewer asks for applicants’ screen names or favorite websites.  Most info can be found at websites not intending to be malicious, such as sports magazines reporting on college football, or the list of speakers at a conference.  In essence, it shows that the information is out there, but it needs to be sorted.  In a manner of speaking, we’re all private eyes.

What we also know from this is that Google indexes indiscriminately.  Sure, there’s a large amount of the internet that has robots.txt files designed to turn away all robots that come their way, but that percentage is small.  In a world where the internet commerce is largely ad driven and everyone is trying to make a buck, disallowing Google, MSN Search, or Yahoo isn’t very helpful.  Heck, most of us are trying to optimize our sites for these engines to maximize the amount of people reading our sites.  In fact, some might even argue that the percentage of the internet that is not indexed is shrinking because the internet is growing. Also, there is a large number of publishers who do not know how to prevent their sites from being crawled by bots or simply do not care.  Anything that these bots do see is indexed.  It doesn’t matter if the content is relevant, factual, or even correct.  If I wrote a page explaining why pi should equal 3, the search engines will pick it up.  If enough people link to it, it might make the first page of Google searches about pi.

As a result, one’s “online presence” is like a credit rating; it’s impossible to really know what it is until asked.  Unless someone points out that there is a page of slanderous and fictional content on the internet that is aimed at you, it is not entirely likely that you are going to find it. 

Therein lies one of the major problems with the search engines: accuracy.  If people are going to do research on others by putting a name into a search engine, there’s no way for the engine to know if the site’s content is accurate.  This is not to be confused with the engine’s ability to verify that the site did contain some information at a particular time.

For example, the Internet Archive’s Wayback Machine has been used in many court cases in the US, as described by this Wall Street Journal article.  In fact, in one case between a Polish TV channel provider and EchoStar, representatives from the Internet Archive signed affidavits that say the content of the archive is accurate to the best of their knowledge.  Playboy is also a regular user of the Wayback Machine to check for copyright infringement.  However, what has not been publicly tested is how the robots handle HTTP 301 (Moved Permanently), HTTP 302 (Found) and HTTP 307 (Temporary Redirect) situations, or other server-side settings which are designed to automatically redirect a bot or a browser to a different web page.  In theory, someone could put up a site called http://georgebushrules.com and redirect it to pornographic content and the archive may not know.  This is why the affidavits have a clause that says “to the best of their knowledge”, since there is no way for a human to verify all of the data that the Internet Archive’s bot is pulling in.

On the other hand, if you changed a site from a Satanic worship forum to create a cute kitten photography community, the Wayback Machine would be aware of this change, and a lawyer could hold you responsible for the content of either version of your site.  Suing the Internet Archive for illegally making a copy of your site is hard too, since as it is a member of the American Library Association, it is exempt from some sections of US copyright law.  That means your Satanic worship forum is going to stay in the archive for a long time.

The problem of a site’s content’s accuracy, however, is a lot trickier.  Assuming that you found a blog entry that had very nasty or very private things to say about you, there may not be a lot of recourse for that.  You could get a lawyer to send a cease and desist letter to the author of the entry, hoping that it will scare them into removing the offending comment. If that does not work, then depending on the circumstances of what was written a libel or slander suit could be launched.  This line of action runs out when the author of the page is not within the jurisdiction of the courts that the suit is being filed in.  Since the internet is a global community, there is a mess of conflicting laws on what is legal in a particular country. But then, you also don’t need a passport to visit a Korean web site.

As a result, there is an increase in the number of calls for search engines to regulate what ends up in their indexes or archives.  This comes out of the reasoning that since the search engine is the point where others are finding private information about others, the engine is the most logical point of attacking the problem.  So far, Google has refused such suggestions, and it is unlikely that Google’s position will change in the near future.

Why?  Let’s go back to the CNET case with Eric Schmidt.

The CNET article linked to articles at Forbes and CNN.  None of the articles cited were written with the direct intent of exposing Schmidt’s dirty laundry.  The Forbes article was detailing “Tech Titans.”  CNN’s article was written during the Presidential Election race of 2000, saying that Schmidt, then at Novell, held a fundraiser for candidate Al Gore.  The event was probably a very public affair (Elton John performed there), so it was obvious to anyone who follows American politics and informative to those looking for news.  Google News would have to completely ignore all political news outlets like CNN’s allpolitics.com (now redirecting to http://cnn.com/POLITICS/), for starters.

Forbes is not a political magazine by nature, but its article had something in it that Schmidt clearly didn’t like, given Google’s reaction to CNET.

However, if Google were to start filtering based on those two linked articles, political affiliations and income, then it would be incumbent on Google to filter out the rest of the pages in the index where similar information is given.  That process is also not humanly possible.

Like the Wayback Machine, there are simply too many sites and too few human operators to ensure that a page doesn’t violate a series of restrictions.  Also, if there was a mistake and a human operator did let something into the index that should not be there then Google is opening itself up to litigation in this lawsuit-happy world.  All someone would have to do is prove that Google violated its own privacy policy.  Considering the amount of traffic that Google gets, the potential readership of personal information is high and a large payout for damages is likely.

Everything mentioned so far does not count doctors, lawyers, and other professionals and businesses that have home pages with their addresses or phone numbers written on them in order to drum up business and make them easy to contact.  In some cases, it’s obvious as to if the address was deliberately exposed, but there are times when a human operator may have to make a judgment call.  If the operator made the wrong decision, Google is looking at another lawsuit.

Or, what would a filtering process do if there was an incorrectly written robots.txt but the page said “Come visit us at our office at 123 Drive Way in San Francisco”?

When it comes to indexing, it really is all or nothing.

CNET, proving that you can “find” all kinds of information about someone, is just a side effect of what Google wants to be: the best search engine on the planet.  Of course, we have to worry about what the pages we look at actually have to say, especially since relevance is not necessarily accuracy,. As with all things, context is important.  It is hard to say if Google is interested in managing that context, but I’m sure someone at the Google offices is looking at it.

It’s a tragedy that we have to worry about this.  Hopefully, there will be a solution which doesn’t involve trampling on free speech or killing search engines with liability insurance.  Technological solutions to sociological problems usually don’t work easily.

Google+ Comments

Google+ Comments