Filtering of Forbidden Words and Search Results - Accidental Filtering, and Filtering by Content Analysis
(Page 5 of 5 )
Two examples illustrating when search results are not displayed, although it is hardly due to filtering issues, are when the server, where a site has been hosted, has been changed. Until the searchbot indexes it again at its new location, it might be excluded from search results. In the past, there were such cases of missing results after a DNS change. Another example for different search results at the same time but in different places is Google France and Google Germany. There are strong reasons to believe that, unlike the censorship cases in China, the difference in search results is due to sites “lost in translation” rather than straightforward, deliberate censorship.
The most sophisticated and powerful form of filtering for forbidden keywords is content analysis. It is also the most manipulative one, because it allows substitution of the set of URLs retrieved from the search engine. The case of China revealed some unpleasant facts about potential second uses of technologies. For instance, routers can recognize separate portions of data. This ability is used mainly to discover and stop worms and viruses, but the same principle can be applied to discovering and stopping certain content.
The precise mechanisms for content analysis are not known in detail, but generally it is done based on profile, which describes what content can be included in search results and what must be skipped. The profile is regarded as a query, and the results that match this query are retrieved from the search engine. Content filtering in this case is done dynamically, based on the current search query by the user. When search results are retrieved from the search engine, they are matched against the profile, and everything that is not acceptable for this profile is deleted from the search results. The remaining set of URLs is then returned to the user, who thinks that these are all the search results for his or her query.
Content analysis and filtering can be done either by the search engine or by the ISP (when acting as a proxy). It is obvious that analyzing each page in a set of hundreds of search results does take significant computing power and slows down search results. For complex queries with many keywords, it might be a very time-consuming operation to perform a search. Automated text analysis and filtering can be executed at the ISP proxy server, on the client's machine, on the Web host or by Web crawlers. Some of the nannyware products implement it as part of their filtering abilities.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |