Navigating the Invisible Web

What do the majority of people do when searching for information on the Web? They fire up Google (or occasionally another search engine) and rely on the returned results to find what they are looking for. In many cases you are lucky enough to find among the first 10, 20, or 30 results the stuff you are after but sometimes even browsing the next hundreds of search results does not lead you an inch closer to what you want. It looks like the thing you are looking for is not out there on the Web although your internal voice tells you that this is impossible; it must be there!

Although it is true that search engines have revolutionized the way we use the Web and have made vast amount of content reachable by everybody, it is also true that search engines (even the most powerful ones) are not almighty. They cannot index to include in their databases every single page on the Web. And what is not in the search engines’ databases you simply cannot retrieve it in the form of search results.

People learn quickly that even if search engines are the easiest way to search the Web, they are by no means the only one. What is more, it seems that in particular cases (for very specific searches) major search engines are no good. They waste much more time and drown users in so much irrelevant information that it is inevitable that one gets furious at them and starts looking for alternatives. But don’t get angry; search engine really do a very nice job on the Web, at least on the visible part of it. If it happens that what you are looking for belongs to the other part, the Invisible Web, just leave Google for a while and follow the other paths to navigating the Invisible Web.

What is the Invisible Web and Why Does It Exist?

First, let’s clarify some terms which are interconnected but identify different things. The portion of the Web that is indexed by search engines like Google, Yahoo, MSN, etc. is often called the Surface Web because it is the topmost, visible layer of all documents on the Web. Estimates are that currently the Surface Web consists of over 20 million Web servers (not to mention the number of documents on the each of them) and the portion that is not covered by general search engines is up to 500 times greater! Of these resources most belong to the Invisible Web, the portion of the Web that due to variety of reasons does not appear in the search results of the major search engines but can be accessed by special search tools. There are also other portions of the Web, referred to as the Opaque and Dark Web which are respectively content that is not linked and cannot be accessed even by special search tools and content that is not for general use (i.e. corporative Intranets). Having in mind the speed at which the Web grows, one can expect that the non-visible pages will increase – both in absolute numbers and as a percentage.

The above clarification on the parts of the Web was necessary because there are different ways to navigate the different portions of the Non-visible Net. For resources that belong to the Opaque and Dark Web, the only way to find them is if someone tells you their URL (and to provide you with the login credentials, if needed). For Dark Web sites, visitors from the general public are not always welcome, and this explains why site owners might explicitly ban search engines from indexing their sites.

So what are the reasons pages or sites go Invisible? There are variety of reasons but the most common ones are:

  1. No matter how refined and powerful search engines and their indexing algorithms are, they still do not have the technical possibility to index every single page the moment it appears on the Web.

  2. The page has appeared after the latest crawl of the spider and it will not be indexed until the next time the spider returns to the site. Chances are that the next crawl of the spider will include many results currently excluded from the search engine’s database.

  3. The site requires registration before use or is fee-based and this prevents the search engine from indexing it. While these conditions hold true, there is no chance to index the site.

  4. The search engine has excluded the site from its database. Sometimes Google (and other search engines) exclude from their listings sites that violate the rules and misbehave by trying to fool the search engine in order to get a top listing. Do not take it for granted, but chances are that (hopefully) one day, the site will return in the search results. Until that day comes, welcome to the (invisible Web) club!

  5. A particular site is locked inside a database and cannot be searched, because it cannot be indexed by Web crawlers. Very often the particular site’s database is completely searcheable but only after you personally go there and perform the search.

  6. Although generally dynamic pages are indexed by search engines, it is not realistic to expect that the last-minute news will top Google. Instead, if you are looking for news, stock quotes or similar items that change often and quickly become old, go directly to a known source: a news portal, the site of a news media company, or a financial institution that provides such information online.

  7. Due to the restrictions search engines have regarding the depth of crawl and/or page size, separate pages or whole sites could be skipped during indexing. For instance, if the URL of your site is http://www.domain.com/dir1/dir/11/dir111/…/mysite/index.html it might or might not get indexed but why use such URL when you can use http://www.mysite.com/index.html instead?

  8. The domain has been classified as a source of spam and because of this all pages that are hosted there are excluded from search engines.

  9. If you are looking for a page/site in a language that has special characters (accented characters, umlauts, etc.), which are sometimes substituted by ASCII symbols, try searching using all possible transcriptions of the keyword. Or go to a national search engine that does not have such problems.

  10. Pages in unsupported format or pages that use question marks in links are generally not loved by search engines and even if the search engine does not have a policy of explicitly excluding such pages, it is less likely that they will be indexed.

The list of reasons is hardly exhaustive but I hope that it will be useful for both users desperately hunting for information and for SEO experts who, although most likely aware of all those reasons, are worried why some or all of their pages do not appear in the search results.

What Kind of Information is Commonly Invisible?

It is not surprising that quality information is hidden in the invisible space. The contents of hidden databases that power the Invisible Net vary, and they are not always of general interest. But to those who are interested in a given topic, it can be a golden mine. Some of the items that often cannot be found by major search engines are:

  • Dynamic sites, for instance knowledge bases, that are not forbidden for the general public but due to the way their content is generated are often skipped by search engines.

  • Specialized databases: medical, scientific, etc.

  • Court records that are available on request

  • Patent and trademark information

  • Archived publications in journals and magazines

  • Library catalogs

  • Product catalogs

  • Classifieds and advertisements

  • Multimedia content and files with special extensions that search engines exclude deliberately

  • News, mailing lists, postings in discussion groups (unless you perform a special newsgroups search)

  • Yellow & White Pages listings

The good news is that almost all that content can be found by the search tools for the Invisible Web.

How to Find Information On the Invisible Web

if your favorite restaurant still does not have a site, no search engine on Earth will help you find it. But if you are looking for information that’s available and isn’t deliberately hidden behind thick walls (i.e. paid sites or corporate Intranets), you will find what you are looking for. You just need some tenacity.

Briefly, what you can do is follow the search techniques (both online and offline) from the time before the advent of the major search engines. This means that hunting for resources on the Invisible Web can be done by going to: specialized directories, searcheable sites and Invisible Web Search Engines, Invisible Web Databases, Meta Search Engines, Virtual Reference Libraries, specialized portals, national search engines, non-English sites, etc. The search resources described next are by no means a comprehensive list; rather, they are listed mainly to give you an idea where to search for Invisible Web pages.

Specialized Search Directories

Search directories are special collections of links that are organized hierarchically by topic. For instance the top-level topics are business, technology, entertainment, education, etc. Each of these topics have subtopics, which in turn have subtopics of their own, etc. In the above example, subtopics of business, for example could be finance, management, services, etc. If the hierarchic structure is clear, it does not take much time to check if what you are looking for is in the directory or to find a directory with the kind of stuff you want. Probably one of the most precious things about search directories is that their content is reviewed by humans and irrelevant stuff is sorted out.

Very often search directories have evolved from mere listings of tens of thousands of sites into a more organized collection of links, providing a search tool, which will help you to retrieve the desired results faster.

Examples of popular search directories are Librarians’ Index to the Internet (http://lii.org/), Infomine (infomine.ucr.edu), Yahoo! (http://www.yahoo.com/ – the Directory Service, not the search engine), About.com (http://www.about.com/), the Open Directory Project (DMOZ – http://www.dmoz.org/), etc.

There are hundreds of search engines that target the Invisible Web. Some of them have a directory service as well, which allows to browse by topic; sometimes the distinction between a search engine and a search directory for the Invisible Web is not so clear. Although it is true that searching the Invisible Web is not as convenient as firing Google and getting search results in a fraction of a second, often the results are worth the effort.

Among the most popular search engines for the Invisible Web are Direct Search (http://www.freepint.com/gary/direct.htm), The Invisible Web Directory (http://www.invisible-web.net/), The Invisible Web (http://www.invisibleweb.com/), CompletePlanet (http://www.completeplanet.com/), etc. The list is too long to be included entirely here.

For the topics of interest to you, consider also topical search engines, which provide selected links, thus saving you time. There are search engines that vary from topics like cooking to space ships.

An important tool to consider when navigating the Invisible Web are metasearch engines. Before Google they were a very popular way to find information because they aggregated the results from a predefined number of other search results, thus delivering Web-wide results. Now metasearch engines like Metacrawler (http://www.metacrawler.com), Dogpile (http://www.dogpile.com/), Copernic (http://www.copernic.com/) and SurfWax (http://www.surfwax.com/) are regaining popularity.

Invisible Web Databases and Virtual Reference Libraries

Perhaps you have already searched the directories and the Invisible Web search engines, and you still have not found what you are looking for. Or perhaps on the contrary, you are finding nothing when looking for very special stuff that is organized in a searcheable portal (like the topical medical databases on PubMed – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi, or FindArticles – http://www.findarticles.com/ where there are over 5 million articles, which cannot be found on the search engines). You can also check some of the specialized databases or virtual reference libraries. Unlike directories and search engines, specialized databases generally contain the stuff itself, not only links to it. A good source for virtual reference is The Internet Public Library (http://www.ipl.org/).

I would like to repeat once again, that the links and sites quoted here are just a small portion of the resources to navigate the Invisible Web. Once one finds his or her tools, discovering the Invisible content can be so much fun! On the other hand, for Web marketers, these resources are also valuable places because they are additional ideas where to submit one’s site in order to make it reachable for more people.

Google+ Comments

Google+ Comments