Search Engines For the Invisible Web

There are many websites and resources on the Internet that cannot be reached by querying the major search engines. Fortunately, there are other ways to reach this Invisible Web. This article discusses the various kinds of search engines and databases that can be used for exploring this hidden gold mine of information.

In a recent article I discussed the existence of a vast number of documents that, due to a variety of reasons, cannot be retrieved using the major search engines. But the so-called Invisible Web exists, and many treasures are hidden from easy access in various ways. Some sit behind the thick “firewalls” of sites that require registration to be searched. Others have complex database structures that search engine crawlers cannot technically deal with. Still others are hosted on a “black list” domain, or are simply skipped because search engines cannot index every single page on the Web the moment it appears. 

In another article I described search directories as one of the tools for retrieving Invisible Web information. Now I am going to give more ideas about which search engines, in addition to the major ones like Google, Yahoo, MSN, and so on, can be used to find resources that are hidden among the vast amount of noise on the Internet.

Probably it is important to clarify that the Invisible Web is also an entity on the move. For example, a site that is not accessible via Google today can be included by the search engine tomorrow, when Google’s spider visits the site. But it is also possible that, if the whole site or certain pages are accessible only after registration (even if it is free), the site will never appear on Google. 

No company works to find only the pages that are not available through searches performed on Google, Yahoo, and MSN, so the search engines for the Invisible Web do not provide only the results that are missing in the major search engines. On the contrary, they provide mixed results that reflect what their indexing algorithms have found. These results include both pages that can be found via Google and pages that cannot. It is the second group that is more interesting.

Setting the dividing line between searchable sites, search engines (topical, general, or metasearch), and (topical) databases is a little bit tricky. Many Web places tend to fall into more than one category. For example, a website might not only link to information on a particular topic, but include articles covering the topic as well — which would make the site both a search engine and a searchable database. In that sense, sometimes the divisions are artificial, and the categorization here is done just to separate them from one another. Because of this, some of the search engines, such as ProFusion, are mentioned under the several different categories in which they fit. Another tool for navigating the Invisible Web — search directories –- has already been discussed in a previous article.

Some of the most popular searchable sites and search engines for the the Invisible Web are Direct Search (http://www.freepint.com/gary/direct.htm), IncyWincy (http://www.incywincy.com/), The Invisible Web (http://www.invisibleweb.com), and CompletePlanet (http://www.completeplanet.com). Although some of them provide additional metasearch options for searching with five, 10, or even more search engines, each of these has its own indexing capabilities for Invisible Web pages.

  • Direct Search (http://www.freepint.com/gary/direct.htm) is considered one of the biggest and most intensively maintained resources for the Invisible Web. It is not only a search engine, but also a search directory with links in many categories. One of its advantages is the topical compilations, which are gatherings of links connected to a specific topic – for instance, Almanacs/Factbooks/Statistical Reports & Related Reference Tools.

  • IncyWincy (http://www.incywincy.com/) is another search engine that explicitly states its purpose to be searching the Invisible Web. It claims that it indexes 50 million pages and hundreds of thousands of search engines. It enables users to search for results with searches and with forms, which bring results from searchable sites. A search box is included under each result, within which you can perform a search as if you had gone to the search site itself. It is useful because it is easy to continue your search immediately. There is a block to the right which shows how many of the total hits for a particular search query fall into a given category, such as Computers, Kids, Society, thus making it easier to refine the search. Additionally, IncyWincy supports metasearches and searching for news.

  • The Invisible Web (http://www.invisibleweb.com) redirects you to http://www.profusion.com. This site was once the first Invisible Web directory, but since then it has grown into a metasearch engine, which allows users to perform vertical searches on many topics. Some of the results in those searches could be found using the major search engines as well, but it is much more convenient when the results are grouped together. For instance, searching the vertical search group Company Tech Support included results from the Knowledge Bases of Apple, IBM, Novell, and Gateway.

  • Lycos Invisible Web Catalog (http://dir.lycos.com/Reference/Searchable_Databases/) may include the word “catalog” in its name, but this resource is actually a search engine that gives you access to all sorts of online databases, related to the specified term.

  • CompletePlanet (http://www.completeplanet.com) is a valuable resource for the Invisible Web because it provides links to over 70,000 searchable databases and specialty search engines. The trick with dynamic searchable databases is that they are more difficult to crawl. Because of this, they are rarely indexed by major search engines. When a search for a term is performed, it is submitted to multiple databases simultaneously, so in a sense this is also a metasearch engine.

  • Geniusfind (http://www.geniusfind.com/) is similar to Complete Planet in that it is a database and search engine finder. Although it does not have the rich resources of the other search engines, Geniusfind offers topical search engines and databases. It can be used to locate an engine or a database for a topic of interest, which a searcher can use to perform an additional search.

The all-purpose search engines for the Invisible Web can be used to directly find the information in which you are interested. But often it is faster and easier to use an all-purpose search engine to locate a topical search engine or a database. After you find search engines or databases on topics of interest, you can go to the site to see whether they provide the information you need. There are search engines and searchable databases for almost any topic imaginable.

Metasearch engines are becoming popular again because they deliver information from several search engines, thus making it less likely that a valuable result will be omitted. There are many more search engines with metasearch capabilities in addition to ones already mentioned. Of these, ProFusion (http://www.invisibleweb.com or http://www.profusion.com), as already discussed, is specifically aimed at searching the Invisible Web. Other popular metasearch engines are old acquaintances such as Metacrawler (http://www.metacrawler.com), Dogpile (http://www.dogpile.com), Copernic (http://www.copernic.com) and SurfWax (http://www.surfwax.com). These can be very helpful, though they do not specifically target the Invisible Web.

Topical Search Engines

There are so many topical engines that you will almost certainly be able to find one or more for the topics of interest to you. However, not all topical search engines are equal. Some are better than the others, because they provide more links and their information is fresher. So, if you encounter a topical search engine that lists sites last updated in 2000 or earlier, do not give up; not all topical search engines for the Invisible Web are that bad!

I am not going to make a long list of topical search engines; it is impossible, and such lists change very frequently. Besides, there are sites and search engines that offer lists of topical search engines, which you can see for yourself. Instead, I will suggest to you a couple of resources that might not be included in such a list, but in my humble opinion are worth seeing.

One of the resources I think is worth seeing is http://news.google.com/. It is valuable because generally, news stories are not indexed by major search engines. Since news gets old quickly (and would already be history by the time a search engine spider indexes it, possibly several months later), it is unlikely that a searcher using the major search engines will be able to find up-to-date news.  Also, news is unlikely to be found in search directories. It is true that news is interesting the moment it is happening, but old news can have a second life as a reference source.

Another useful resource is http://www.podscope.com/. This is a search engine for podcasts. Although it is still not very popular, it contains a lot of links. I guess that with the growing popularity of podcasting, this search engine will be a hot topic soon. One of its extras is that, for each retrieved item there is a Play button (link to the site, XML feed option, and so on) that allows users to hear the podcast without leaving the site.

Topical databases are also useful when navigating the Invisible Web. What makes databases different from search engines is that databases actually host the content they link to. This does not mean that they do not have external links to content they do not host, but their own content is abundant. As with topical search engines, there are topical databases for almost every conceivable topic, so if you search, you will find.

Among the visible but still widely unknown resources is http://catalog.google.com. For many people Google is just the number one search engine and nothing more than that. But Google delivers much more. The example with the news search in the previous section and the link to the catalog in this one prove that Google does more than simply index the Web. I paid attention to this URL, because product catalogs are typically inhabitants of the Invisible Web.

Another valuable resource for products, although not exactly the catalog a manufacturer dreams of, can be found at http://www.cpsc.gov/cgi-bin/recalldb/prod.asp. This database allows users to find recalled products by product type, and contains a lot of useful information. It can be very helpful to those who want to get familiar with what goods to avoid due to their defects.

I could list at least a hundred more topical databases, but I do not think that after providing you with ideas about how to find them, you actually need such a list. Go forth and search!

Google+ Comments

Google+ Comments