Taking a DeepDyve into the Deep Web

If the latest figures from Internet research companies can be believed, Google serves the needs of most web searchers most of the time. But serious researchers know that the web boasts nooks and crannies that even Google’s spiders can’t reach. DeepDyve thinks it can help solve this problem.

These secret caverns, nicknamed the “deep web,” hide their content behind subscription-based firewalls. Mainly of interest to academics and researchers in specialized fields (such as law or medicine), these databases of information are expensive to build and expensive to maintain. Fortunately, their main clientele includes those with the money to pay for access: lawyers, large libraries, medical schools, major universities and the like.

That’s not quite so helpful for the individual, independent researcher. There are parts of the deep web that don’t require a subscription, but a lone researcher may face other obstacles. Many important scientific papers don’t receive a lot of links, regardless of their scholarly citations. This stymies the search engines’ usual approach to finding documents. So what can you do when you know the information is out there, but you just can’t get to it by the usual means?

This is where DeepDyve comes in. Two bio-informatics scientists who worked on the Human Genome Project founded the company as Infovell in 2005. Their genetics background shows in the algorithm DeepDyve uses. Called “KeyPhrases,” it indexes passages up to 20 keywords in length rather than single keywords. Indeed, rather than focusing on key words, KeyPhrases matches patterns and symbols. DeepDyve CEO William Park told Wired that the algorithm “is really doing pattern matching; it’s not at all language dependent. In fact it’s actually language agnostic.”

This reflects a genetics background in at least two ways. First, think of the length of the human genome; it’s huge, and made up entirely of three-letter “words,” the amino acid codes that together form the chains of proteins that keep our bodies functioning. DeepDyve’s KeyPhrases algorithm uses indexing techniques from the field of genomics; if they work on the human genome, what can they do for the Internet?

And then there’s the more prosaic aspect. Have you ever tried to hunt down medical research papers on the web? DeepDyve’s web site describes the process as “frustratingly limited and time-consuming.” An ordinary search engine won’t support the kind of complex queries real scientific researchers need to make. Worse, it can’t access the deep web for reasons I mentioned before. Is it any wonder that such engines will return too few, too many, or irrelevant results for this kind of research?

Sites with specialized research engines present their own problems. Aside from being expensive, they can be difficult to use. Not every researcher knows Boolean language as well as their favorite topic. But enough about the problems and how DeepDyve’s algorithm works; what is it like for the end user?

Well, you are required to register with an email address. Signing up is quick and painless (be sure to let them through your spam filter). Once you do, you get to see this pretty typical search screen:

I’d like to draw your attention to something very important that isn’t typical: the search box. That’s not simply a box; it’s practically an entire apartment building when compared to most search engine text boxes. They’re not kidding when they say they’re set up to search long strings!

You can see that DeepDyve offers examples. I know the search engine will be optimized for those examples, but since I honestly didn’t have a better query in mind, I clicked on the one for cancer treatments. It gives a good idea of how DeepDyve works:

At the top you can see this search involved no mere hunt for the words “cancer treatments.” No, DeepDyve started me with “Turning off a protein that helps grow blood vessels that feed tumors actually makes cancers get bigger, not smaller, according to two new studies.” There’s more, but that captured my attention right there (good target marketing, DeepDyve!), since I’d read in New Scientist about a study that hinted at the opposite: starve a tumor of its blood supply, and it shrinks.

That pane on the left is for the subject areas covered by the search. DeepDyve says it contains 500 million web pages in its index covering a variety of topics; more on that in a bit. As you can see, all of the subject areas are checked, and my search returned a little over 410,000 results. DeepDyve returned the top 250, and by default shows them to me ten at a time (I could change that to 25 or 50 at a time, if I wished). You can minimize the pane if you find it distracting.

The blurbs under each entry typically span an entire paragraph, which is longer than the sentence or two you get from most search engines. It’s not unusual to need to read this much of a research paper before you know whether or not it will be useful to read the rest. DeepDyve makes the source of each item crystal clear, both from the icons on the left side of each entry and from displaying the author’s name, name of publication, and country (whatever information is available, presumably) under the link.

If you’d rather not see something so verbose, you can switch to a summary view. You’ll get a title, a sentence, the date the item was last updated, and the source. Likewise, instead of sorting the list of results by relevance, you can tell DeepDyve to sort by date or source.

Each entry features a “more” link. This doesn’t take you to the article; instead, it performs a little Ajax magic that delivers a box with more information about the article. It includes the title, source, author, journal, publisher, subject, ISSN, and doi; those last two are numbers that will help you find the article. From here, you can save the article to a folder that you name; you can also do this directly from the results. Additionally, you can use this box to help you find matches to your query in the text of an article. I didn’t want to show you the entire Ajax box because I would have had to shrink it too much, but here’s a close-up to explain what I mean:


“Save” will save the item to a folder you choose (this pertains to your account with the search engine, not your PC). “Search” lets you select text for which to search within the item in several different ways. “Original text” takes you directly to the article, and “prev” and “next” let you preview the previous and next entries that turned up in your search results.

DeepDyve does a nice trick that I’m pretty sure we won’t be seeing Google match any time soon. You may have noticed that entries feature an arrow that says “More like this” (yes, I know, with the image shrunk it’s hard to tell; bear with me). Click that arrow and DeepDyve takes the entire article and runs it as your query. Not just a few keywords, but the entire contents of the article!

It’s hard to pin down my favorite DeepDyve feature; I like the way that, in general, it’s put together to help you do continuing research. The fact that you can save and repeat searches, save links to particular documents in folders, and so forth, really brings this out.

Before I wrap up this section, I want to note that I went beyond cancer, and beyond the search engine’s examples. I did a search for “Planets which are in the habitable zone, a zone in which water can exist in a stable state as a liquid.” I turned up several planets, plus some interesting articles (including one that talks about why extremeophiles may be irrelevant to the origin of life). A slight rephrasing of the query turned up an article about habitable moons around giant planets – just the thing for a budding science fiction writer who wants to get the science right.

So DeepDyve returned a number of interesting items, but we all know that a search engine isn’t any better than the web sites it indexes. What does DeepDyve index? A lot, as it turns out. Here’s the list I received along with my activation email:

  • Life Science and Medical: Includes over 600 full text journals, including Annual Reviews, BioOne, Sage, and Mary Ann Liebert; databases including Medline (containing abstracts from over 15,000 journals), CRISP database of federally funded biomed research projects, the World Health Organization Model List of Essential Medicines, and news from industry websites like Nature, WebMD, Pharmaceutical Online, and Genetic Engineering News.
  • Physical Sciences: Current news and information from the web in the physical sciences, information technology, clean technology and energy from major online news bureaus such as The New York Times, Forbes.com, CNN, Financial Times, and Reuters, as well as open-access industry websites.
  • Patents: Nearly 12 million documents from the US and European Patent offices.
  • Wikipedia

I know, Wikipedia looked a little odd to me too, but there it is. DeepDyve isn’t finished adding to its index by a long shot. It plans to tackle topics such as information technology, clean technology, and energy. Oh, and if you’re still not entirely happy with what you can do with it, for $45 per month you get some extra ways to manipulate your documents, explained in the search engine’s tour.

Not everyone will want or need to use DeepDyve. And of those who could benefit from what the search engine can provide, many have access, one way or another, to the parts of the deep web that sit behind various kinds of virtual walls. But for those of us whose hunger for information never seems to be satisfied, it’s one more tool with which we can work.

Google+ Comments

Google+ Comments