Filtering of Forbidden Words and Search Results - How are Forbidden Keywords Filtered?
(Page 4 of 5 )
Depending on where filtering is done, there are several possible ways to do it. If filtering is done by an ISP or at the company gateway, usually there is no substitution of the search results originally retrieved from the search engine. Instead, when users try to visit the site, they get a message of a network problem, connection timeout or just a plain-text message that this site is banned. It is also possible, if the search string itself contains forbidden words, that the firewall will not allow the search to be performed, and again a message with an explanation of the reason will be shown to the user.
Usually ISPs or company gateways filter based on port and IP, but advanced technology allows for more subtle content-based filtering. Simple keyword filtering is most often accomplished by using either whitelists, or blacklists. Whitelists are lists of sites matching particular keywords, which users are explicitly allowed to visit, and all other traffic is blocked. Blacklists are just the opposite -- the list of sites that match the criteria for forbidden keywords, and users are therefore not allowed to visit. Most “nannyware” (software for controlling kids' access to Internet) functions on the principle of whitelists and blacklists.
It is obvious that lists of allowed/forbidden keywords employ a simple tactic, which is prone to a high percentage of error. Even the most thoughtful planning cannot include all allowed keywords or exclude all forbidden ones. Things get more complicated when one remembers that it is not unusual for a word to have several context-dependent meanings, not all of them necessarily bad or good. Also, there are two-root words or phrases that contain a “forbidden” word, but the meaning of the whole word or phrase has nothing to do with the meanings of the words separately. In all of these cases, there is a potential risk of filtering good sites together with bad ones, or allowing bad sites together with good ones.
Although filtering software gets more and more sophisticated and allows the user to customize it, it still makes many mistakes in correctly distinguishing sites. And when images are concerned, the results of filtering are often not exactly the desired ones. Since filtering is based either on the description in the <alt> tag (where one can write anything), or on the abundant presence of flesh tones (this is how pornography is separated from other pictures), it is not unusual to see such software ban a Renaissance painting (because of the nudity in the picture) and allow an objectionable one (where besides flesh, there is furniture, for example, which helps to fool the censorware).
Another approach to filtering forbidden keywords is meta tags. Meta tags are not used by Google (or at least not extensively), but are still used by many other search engines. With this approach it is clear that, if a forbidden word is contained in the meta tag, most likely the page will not be displayed in the search results at all, or will be shown somewhere in the second hundred or thousand of search results.
Meta tags are a two-edged sword. If there is a forbidden word in them, they are banned. On the other hand, if an allowed word describing the page is missing, parent control software might classify the site as not rated and skip it. The reason for the second case is that some of the nannyware programs rely on meta tags in order to rate the site, and if the meta tags do not say that the site is a rated one, it might automatically be included in the list of banned sites.
Next: Accidental Filtering, and Filtering by Content Analysis >>
More Choosing Keywords Articles
More By Tsvetanka Stoyanova