Learning to Crawl: an Investigation of the Personal Web Crawler - A better way? (Page 2 of 4 ) There is, of course, another way to approach all this. Why not cut out the middleman? Rather than expecting Google or whoever to magically understand their precise requirements, users can use their own personal web crawler to seek out the information they need. This has a number of significant advantages.
Targeted information The main advantage is that, unlike those used by the search engines, your own web crawler can be configured to target precisely what you need. As an example, it’s useful to think about media research organizations whose business it is to seek out media reports about their clients. These clients are typically famous individuals or well-known corporations who need to locate information about themselves, usually as the basis for commercial decision making. The Web is obviously a major repository for such information, but a vast amount of pointless information is sure to be found alongside the useful data: blog references, mentions in passing in articles dedicated to other subjects, and references to namesakes are just some of the kinds of information that would typically need to be filtered out to make such research worthwhile.
A personal web crawler provides one answer to this situation, since it can be configured to ignore certain types of references, or simply to only search certain sites or types of sites. This offers a high degree of control over the information that is returned for a particular search, vastly increasing the likelihood that it will be relevant. Background operation Another key advantage of the personal web crawler is that it can work in the background. Unlike typical searches which must be carried out actively by typing search terms into the search engine’s interface, a web crawler will continuously monitor the web – or at least the areas of it you specify - and return results as they are uncovered. This is a far more efficient approach for those who seek similar types of information on a regular or ongoing basis. - Privacy
The privacy benefits of personal crawlers shouldn’t be underestimated, especially in these times of increasing concern over the amount of personal data gathered and retained by Google and other major search engines. This data has commercial value, allowing the delivery of precisely targeted advertising, but many individuals quite justifiably believe that what they search for on the web is nobody’s business but their own.
With your own crawler, privacy ceases to be a concern, since search records are not visible beyond the local network. On a similar theme, and unlike public search engines, personal crawlers can not be censored, making them highly useful in localities where web access is restricted for political or social reasons.
Of course there are limitations to the uses of personal web crawlers. A major constraint is imposed by the sheer scale of the Web: to crawl the entire Internet with any degree of efficiency would require a server farm approaching the size of Google’s, which is obviously an impossible aspiration for an individual or small organization. For this reason, crawlers are most suitable when searches can be usefully restricted to very specific areas of the Web – newspaper and media sites for instance, as in the above example. There are also limits on the sheer volume of information it is reasonable to expect a personal crawler to gather and index. Again, to attempt to index the entire web would be ridiculous as well as meaningless. Large scale search engines can do that kind of thing much more efficiently than any individual. The core strength of the personal crawler lies in accurately indexing large amounts of very specific information. Used appropriately, it can remove much of the drudgery of this kind of task, leaving the user free to use the information it gathers, which means that he or she won't waste time looking for it. Next: Examples >>
More Search Engine News Articles More By Bruce Coker |