Learning to Crawl: an Investigation of the Personal Web Crawler

After all this time, shouldn’t searching for something on the Internet be an easier process? Sure, it has vastly improved over the years, but there is still a significant element of hit-and-miss involved. Read on to learn about a different approach to finding what you’re looking for online.

Despite the Web having been part of our lives for well over a decade now, the fundamental task of searching it for information remains something of a lottery. Locating general information is easy, but finding something specific can present a significant challenge, even to the most experienced Internet researcher. Most searches still involve typing arcane expressions into a search box, applying quotes to limit the terms, and gradually gradually refining the key words in an effort to narrow the results down to those that are most relevant to the required material. Sometimes this approach can be effective, but at others it can seem like searching for a needle in a planet-sized haystack.

Over the years a large number of search engines have attempted to improve the accuracy of the search experience. From Google, who revolutionized the process back in the early days, right up to the newest contenders such as Hakia with its Query Reprocessing technology and Searchmash which provides segregated multimedia results, companies have tried to find new ways of more accurately locating and delivering the information their customers need. Many of them have helped make our lives easier.

But despite all that, every one of these search engines and mechanisms is based on the same basic technology: web spiders and crawlers of one kind or another that travel the Web, locating and indexing information that is then returned in response to the search queries of end users. The task faced by these crawlers is monumental. According to the latest Netcraft survey, the Web as of mid 2008 consists of over 175 million sites that contain literally billions of pages.

Between them, these pages cover a range of information too vast for a single individual to conceive of. So just how well can a multi-purpose search engine with its widely targeted crawlers be expected to meet the specialized requirements of its individual users? The answer, sadly, is probably not very well at all, as is suggested by the ongoing frustration expressed by typical Web users at the difficulty of finding exactly what they’re looking for.

There is, of course, another way to approach all this. Why not cut out the middleman? Rather than expecting Google or whoever to magically understand their precise requirements, users can use their own personal web crawler to seek out the information they need. This has a number of significant advantages.

  • Targeted information
    The main advantage is that, unlike those used by the search engines, your own web crawler can be configured to target precisely what you need. As an example, it’s useful to think about media research organizations whose business it is to seek out media reports about their clients. These clients are typically famous individuals or well-known corporations who need to locate information about themselves, usually as the basis for commercial decision making. The Web is obviously a major repository for such information, but a vast amount of pointless information is sure to be found alongside the useful data: blog references, mentions in passing in articles dedicated to other subjects, and references to namesakes are just some of the kinds of information that would typically need to be filtered out to make such research worthwhile.

    A personal web crawler provides one answer to this situation, since it can be configured to ignore certain types of references, or simply to only search certain sites or types of sites. This offers a high degree of control over the information that is returned for a particular search, vastly increasing the likelihood that it will be relevant.

  • Background operation
    Another key advantage of the personal web crawler is that it can work in the background. Unlike typical searches which must be carried out actively by typing search terms into the search engine’s interface, a web crawler will continuously monitor the web – or at least the areas of it you specify – and return results as they are uncovered. This is a far more efficient approach for those who seek similar types of information on a regular or ongoing basis.

  • Privacy
    The privacy benefits of personal crawlers shouldn’t be underestimated, especially in these times of increasing concern over the amount of personal data gathered and retained by Google and other major search engines. This data has commercial value, allowing the delivery of precisely targeted advertising, but many individuals quite justifiably believe that what they search for on the web is nobody’s business but their own.

    With your own crawler, privacy ceases to be a concern, since search records are not visible beyond the local network. On a similar theme, and unlike public search engines, personal crawlers can not be censored, making them highly useful in localities where web access is restricted for political or social reasons.

Of course there are limitations to the uses of personal web crawlers. A major constraint is imposed by the sheer scale of the Web: to crawl the entire Internet with any degree of efficiency would require a server farm approaching the size of Google’s, which is obviously an impossible aspiration for an individual or small organization. For this reason, crawlers are most suitable when searches can be usefully restricted to very specific areas of the Web – newspaper and media sites for instance, as in the above example.

There are also limits on the sheer volume of information it is reasonable to expect a personal crawler to gather and index. Again, to attempt to index the entire web would be ridiculous as well as meaningless. Large scale search engines can do that kind of thing much more efficiently than any individual. The core strength of the personal crawler lies in accurately indexing large amounts of very specific information. Used appropriately, it can remove much of the drudgery of this kind of task, leaving the user free to use the information it gathers, which means that he or she won’t waste time looking for it.

YaCy

YaCy is probably the best known personal web crawler. It provides powerful capabilities. A single YaCy installation can index and store over 10 million documents, and in a multiple peer configuration there is no upper limit to its capacity. Among its claimed advantages are:

  • The ability to locate information that other search portals hide.

  • The ability to share indexes and create distributed search networks in a community of independent YaCy users.

  • The ability to search within different file formats and different types of media, including common audio and video file types.

  • The fact that it is based on a peer-to-peer web index exchange interface with no central servers. This means that searches are anonymous with no central search data logs.

For more information on YaCy see:

http://webscripts.softpedia.com/script/Search-Engines/YaCy-45386.html


Subject Search Spider

SSS from Kryloff Technologies is a commercial personal web crawler designed to save time and increase productivity by automating much of the search process. Among other things it claims the ability to:

  • Communicate with an almost unlimited number of search portals.

  • Visit, scan and quote web pages, storing the content in libraries for later use.

  • Identify material in which the search terms have been altered or mis-spelled.

  • Create browser-viewable reports of its findings.

  • Support queries in multiple languages.

One of the great benefits of SSS is its customizability. The ability to create precisely targeted searches makes it, according to its developers, “the only true personal meta-search engine that is fully configurable by the final-end user.” It is also designed to use disk space efficiently. Rather than storing entire documents locally, SSS retains only the essential information to allow the relevance of the document to be determined. It provides links to the full document in its original location if more extensive viewing is required.

For more information on Subject Search Spider see:

http://www.kryltech.com/spider.htm

The Copernic search agent, although not exactly a web crawler in the traditional sense of the term, has enough similarities to merit inclusion. Rather than indexing the web itself, it is an “intelligent” search tool that queries multiple search engines and processes what it receives to present filtered, usable results from which much of the useless material has been removed. It perhaps more closely resembles a meta search engine than a crawler. A number of additional functions are included that further improve the usability of the results:

  • Returns results from over 90 categorized search engines.

  • Results are improved by removing duplicate entries and retaining the most relevant.

  • Built-in tools allow the removal of broken links, searching within the results, local saving, result sorting and extensive reporting.

  • Searches are saved for later use, making them available instantly for use, modification and update.

The Copernic agent is available in three versions. The "basic" version is a free download, and offers standard searching and reporting capabilities. The "personal" version provides greater facilities for customization and personalization along with advanced result management capabilities. The "professional" version comes closest to the functionality of a web crawler: in addition to the basic and personal functions, it will track new search results in the background as they become available, track changes in web page content, and provides powerful tools for the analysis of search results.

For more information on the Copernic Agent, see:

http://www.copernic.com/en/products/agent/index.html

Google+ Comments

Google+ Comments