As always, let’s start with the facts. In late July, AOL posted data to an AOL research site. This data covered searches conducted in March, April, and May of 2006. It covered 20 million uncensored queries from about 658,000 users, or the equivalent of between one and two percent of the searches conducted through AOL during May. The users were chosen randomly and rendered anonymous by the simple matter of associating an ID number with the searches instead of a name.
If you think that’s not enough to keep a searcher’s identity secret, you’re right, as you’ll see in a minute. For now, the point I want you to understand is that, at the time, AOL did this intentionally, apparently to get recognition from the research community by putting up a data set that can be regularly cited in research papers. It was not an accident.
Here my sources stop agreeing with each other. Media Post Publications says that the queries were on a publicly viewable web site for about two weeks before bloggers noticed them during the weekend of August 6-7, which led to their removal. Another source says they were up for only one week. Yet a third source claims the material was available from AOL for only three hours on August 4. Whatever the truth might be, the result was the same: data that’s been released onto the Internet cannot be easily recalled.
By the time AOL yanked the data, the damage had already been done. A number of sites had downloaded the file and put a friendly interface on it to make it easily searchable. It didn’t matter that it was huge: 436 MB, or 2 GB unzipped. It was now out in the wild, causing the inevitable ripples.
It wasn’t just the bloggers who picked up on the release of data and took AOL to task for it (more on that in a moment). By August 7 AOL was in full retreat. It had yanked the data and issued an apology. Here is the bulk of the statement, from AOL spokesman Andrew Weinstein:
“This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.
“Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.”
Weinstein then goes on to explain the nature of what was released, saying that the data included “roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.” That must have been cold comfort if you were among those whose searches were chosen.
Needless to say, this wasn’t the end of it. AOL CTO Maureen Govern chose to resign (possibly to avoid being fired). Two other employees were fired: Dr. Abdur Chowdhury, the researcher who posted the data, and one other employee. Dr. Chowdhury seemed to be almost criminally naïve as to how the data could be used, despite being “anonymized.” An article in the New York Times reported that Dr. Chowdhury revealed how horrified he was over the privacy violations while having lunch with two of his colleagues and a University of Washington professor of computer science. “He didn’t anticipate that this kind of data could be used to track down individuals,” explained Professor Oren Etzioni.
In an email to AOL employees, company CEO Jon Miller stated that a task force would be created to develop new best practices concerning privacy. It would also consider how long search and other data should be saved. Furthermore, the company would reexamine its restrictions on access to databases containing sensitive member data. “After the great lengths we’ve taken to build our members’ trust and be an industry leader on privacy, it was disheartening to see so much good work destroyed by a single act,” Miller stated in the email. “This incident took place because some employees did not exercise good judgment or review their proposal with our privacy team. We are taking appropriate action with the employees who were responsible.”
Dr. Chowdhury may not have anticipated that the AOL search data could be used to track down individuals after anonymization, but the media certainly did. Many bloggers pointed out that the searches included phone numbers, addresses, names, and possibly even Social Security numbers. Since many searches included local elements, it would be an easy matter to narrow a searcher down to a geographical location, and even an age and gender.
Two publications took it one better: they proved that you could trace a person from their searches. The New York Times traced one set of searches to an actual user, and revealed her identity (with her permission): 62-year-old Thelma Arnold, who lives in Georgia, loves dogs, and has a number of friends with serious physical ailments. Wired News stated that it was able to discover the identity of a 14-year-old through the search records.
If this alone wasn’t enough to scare people, many reporters were more than happy to point out that certain anonymized users were conducting the kinds of searches they wouldn’t want to have talked about on television. CNET posted a list of searches from 16 different users that went on for more than four pages. They revealed a number of disturbing things, like the successful, overweight, apparently conservative searcher with an interest in child porn, not to mention all the people looking to get even with an ex or divorce their spouses – or even kill them, in at least one case. There were also those trying to deal with depression (or not deal with it, judging from the queries about committing suicide); make a new start; avoid dealing with a DUI ticket; avoid paying taxes, legally or otherwise; and so on. As one observer noted, it’s almost as if search engines are the new confessional, taking in all our “sins.”
There can be no doubt that AOL’s release of search data was an unmitigated disaster in certain quarters. But to others, it was like water in the desert – or perhaps a better analogy would be the somewhat guilty pleasure of a mug of hot chocolate in the afternoon to keep you going. Here, of course, I’m talking about marketers and researchers who are all but desperate in their need for data.
Steve Beitzel, an affiliated researcher with the Illinois Institute of Technology’s Information Retrieval Lab, raised an interesting point. “Researchers at universities or small companies don’t have access to this type of data. I think the [AOL] researchers were trying to do a good thing by making this available to the research community.” The road to hell, as they say, is paved with good intentions; on the other hand, Beitzel is correct in fingering this as a problem.
How often do researchers have access to this kind of data? In 2003, hundreds of thousands of internal email messages were opened to the public on the Federal Energy Regulatory Commission’s web site. Web researchers pounced on this horde, and several research papers focusing on it have emerged since. It is the only large body of actual email in the public domain. As to actual search data, academic researchers have to settle for two sets of data – one from Excite and one from Alta Vista, and both of them nearly a decade old.
Search engines have improved tremendously in that amount of time, and searchers have changed their strategies accordingly. That makes the old data practically useless. While the fresh data is seriously tempting to many academic researchers, some won’t touch it. Jon Kleinberg, a professor of computer science at Cornell, downloaded the data – but has decided against using it. “The number of things it reveals about individual people seems much too much,” Kleinberg explained. “In general, you don’t want to do research on tainted data.”
Marketers have been less reserved about using the data, though so far the discoveries haven’t been as awful as you might think. One observer noted that “No one can spell” and that users are apparently in serious need of three kinds of vertical search engines: one focused on health, one for religious queries, and one – of course – for pornography. Another observer noted from examining the data that “Satisfying a search intention may take weeks – or months,” meaning that advertisers should be patient with their search campaigns and think long term.
It will be some time before we feel the full effects of AOL’s breach of privacy debacle. But it has certainly exposed us to the conflicting needs of everyone who uses search. Now that all of this is out in the open, perhaps we’ll see a discourse about how search engines can best serve their users – all of them.