Continuing its never-ending quest to take over the world – er, sorry, organize all of the world’s information so that it is easily accessible and searchable – Google just received a patent this month for “a voice interface for search engines.” Patent number 7,027,987, which you can read in its entirety here, hit the SEO community like a storm. You’d almost think it came out of nowhere, especially with how tight-lipped Google has been about it. But these things always have some background, which I’ll show you in a bit.
First, let’s take a look at the patent. The co-inventors credited include Monica Henzinger, Alexander Franz, Brian Milch, and Google co-founder Sergey Brin. Those names are worth remembering, particularly Franz and Milch.
The patent goes on to describe the interface as “A system [that] provides search results from a voice search query. The system receives a voice search query from a user, derives one or more recognition hypotheses, each being associated with a weight, from the voice search query, and constructs a weighted Boolean query using the recognition hypotheses. The system then provides the results of the search system to a user.” One analyst translated this as “the system listens to your query, does its magic, and returns the results.”
It’s a bit more complicated than that, as the paper Franz and Milch wrote back in 2002 makes clear. Titled “Searching the Web by Voice,” the five-page document (available here: http://labs.google.com/papers/webbyvoice.html) explains the hurdles faced by trying to create an accurate voice search interface. While some of the math was beyond me, one point stood out: after they had optimized it to the best of their abilities, the interface (to quote from the article) “can return the correct transcription of a spoken query among its top 10 hypotheses about 60% of the time.” That’s not bad, but it’s not exactly good either, and it’s certainly not “magic.” Of course, this was also four years ago, and we know how much technology can change in that time.
Speaking of time, I’d like to give you a little more of a historical timeline as background about Google and voice search. You’ll see, as I said earlier, that this patent didn’t come out of nowhere.
We can start with speech-to-text tools. Those have been around for decades. Of course they weren’t very good at the beginning, because of accents and subtleties in speech. Even today, using modern speech-to-text software, the program and the speaker end up training each other to some degree.
Moving further forward in time, IBM was talking about voice-in, text-out as early as 1999. Big Blue saw it as a way to get around the problems of decreasing cell phone size. Of course, well before that you had the 411 service evolving into a computer-driven voice-in, voice-out service that used human operators only as a back-up if the machine couldn’t decipher the query.
Google filed for its patent in 2001. At a guess, Franz and Milch were already working on their paper at the time, or at least finishing up the research. Interestingly enough, V-ENABLE, a company not affiliated with Google, was founded in the same year. On its home page the company bills itself as “the leading provider of mobile speech search solutions…” Craig Hagopian, president and COO of the firm, recently noted that Google’s patent appears to be complementary to his company’s technology – and turned coy about whether and what kind of discussions he was having with Google.
But now let’s continue with the timeline. Google regularly puts potential products up on the Internet in the Google Labs section of its website. These not-quite-ready-for-prime-time items sometimes go up with no fanfare at all; they’re just there for users to discover. So it was with Google Voice. As near as I can tell, this service went up sometime in 2003; you can still see it here (http://labs1.google.com/gvs.html). Late in 2003 the service was deactivated. It worked by having a user call an automated phone number, stating their query, then clicking on a link on the demo page to see their search results.
How good was it? One person commented to a blog discussing the patent that “I used the service back in fall 2003 I believe. It was pretty amazing – it understood what I said, and the results were instant on my screen. I caught it right at the end of the demo phase, I think. I tried to show my co-workers a couple days later and it was already down, and it’s been down since then…”
Getting a voice-activated search interface to work well enough for people to use is not a piece of cake. First, there’s the fact that most search queries are short: usually five or six words at most, and often more like two or three. This would be fine if you could try to match those words up with a limited “vocabulary” – indeed, that’s why so many phone trees designed to recognize voice input work as well as they do. But a search engine requires recognition of a huge vocabulary; as Franz and Milch note, “Even a vocabulary of 100,000 words covers only about 80% of the query traffic.”
Even those two problems wouldn’t be so bad, but remember that a voice-activated search interface needs to deliver results in real time. Web surfers are used to seeing results from a search engine within fractions of a second of hitting “Submit;” indeed, Google’s own website always tells you how quickly it conducted your search, like a badge of pride. Anyone using the interface that is told “One moment please” and has to wait more than a second or two is going to feel like they’re being put on hold. I don’t know about you, but I get put on hold too frequently as it is; I don’t want that kind of frustration from my search engine.
Then of course there’s the obvious challenge of interpreting what the speaker said, even if the speaker has an unusual accent, or mumbles, or is in a noisy environment, or has a speech impediment, or…you get the picture. Text is much easier; even with misspellings, there are a limited number of possibilities (and the “Did you mean…?” clickable sentence that sometimes pops up when you search Google often takes care of that). If people from different parts of the same country have trouble understanding each other, what hope does Google have of doing better?
Leaving aside the technical challenges, it’s worth mentioning that Google is not alone in this field. VoiceSignal is a company that converts voice to text. I’ve already mentioned V-ENABLE; Promptu is another company that is working on voice search for mobile applications. AgileTV makes voice recognition software to search television. Microsoft may have been working on something like this as well. Remember the brouhaha over Kai-Fu Lee leaving the software giant for Google? That might have had less to do with opening China and more to do with the fact that his area of expertise is natural language.
The most obvious application for a voice-activated search interface is search on mobile devices such as cell phones. In fact, a Piper Jaffray research report indicated that the market for mobile search would reach $11 billion by 2008. That’s money that Google can’t afford to ignore. Just think how happy advertisers will be when they can reach mobile phones with contextual, search-based advertising!
Of course, this technology would make wireless companies very happy as well. Using some functions on your cell phone costs minutes, and that would certainly include mobile search. Most cell phone plans come with a limited number of “anytime” minutes. Use those up, and they can charge through the nose for those extra minutes – leaving you in shock when you get your bill.
Lest I start sounding too cynical, I’d like to add that I see plenty of other uses for a voice-activated search interface. It could be used to help blind web surfers or others who have problems using a keyboard. It could also be used to launch a phone number search service to compete with the current 411 providers.
If we think about ways that Google might expand the technology, there are some interesting potential applications. What if you could search your phone calls as easily as you can search the Internet? I’m not enough of a packrat to want to keep all of my phone calls, but I could see a service like this settling some arguments, among other things.
Or how about a voice interface for maps and navigation in a car? With global positioning to tell the service exactly where you are, you could ask it where the nearest pizza parlor is, and it would respond with a name, map, and turn-by-turn directions. As long as we’re dreaming here, maybe it could respond with the three nearest places and reviews if you want them!
Of course, that raises a whole new issue that Franz and Milch mention at the end of their paper. While their experimental data consisted of keyword queries, “voice search users might prefer to ask questions or make other types of natural language queries [emphasis in original]…” They go on to say that this would actually be easier to model and recognize. This used to be one of the Holy Grails of search. It will be interesting if we finally see it achieved.