The algorithms (and especially their intricacies) of the search engines for compiling the search results for a particular search query have been constantly discussed on seochat.com. Probably the lack of reliable information — from the horse’s mouth — is the fertile ground for so much speculation and guesswork regarding how search engines work, and what makes a page pop up in the first place in a search on one day, drop into the second tens on the next day, and sometimes not even show in the search results at all.
There would not be so much discussion about all of these details if search engines were not such powerful instruments, both for driving traffic to one’s site and for delivering the information users need and want. Although search engines are not the main “censor” in today’s society, one cannot just wave a hand and say “leave it” when such “minor” details as filtering and censorship of any kind do influence the retrieved search results.
What is Filtering?
Filtering, as the name implies, is the process of selecting which of the pages that meet the search criteria are to be displayed in the list of search results. When filtering is done in order to ensure the relevancy of the returned search results, there is nothing wrong. Basically, searching is all about relevancy — getting only what is necessary from the uncountable pages on the Web. But when filtering is influenced by other techniques, different from relevancy, one starts to wonder if filtering is not just another word for censorship.
There are many approaches to filtering sites which are not desirable to visit. Although most filtering of content, labeled as inappropriate, occurs more often by ISPs, governments, employers, and parents, filtering of sites is done by search engines as well. Let’s make one point clear: filtering by search engines can hardly be regarded as intended censorship, unless certain interests (such as in the case of Google and Yahoo with China) try to manipulate the results.
Filtering by search engines is more subtle -– i.e., there is no direct ban to visit the site or make the page inaccessible. Although there are cases when certain sites are absolutely excluded from search results, most often sites with “inappropriate” content are simply buried in the second hundred search results, which is more or less a guarantee that they will not be noticed and visited.
How does this happen? I bet nobody knows the precise algorithms that affect the ranking of a page, which contains forbidden keywords. And to make things more complicated, I seriously doubt that anybody has an idea about the words that are labeled as forbidden by search engines. And when there are no clear rules for what is acceptable and what is not, it is logical to ask whether filtering is not just a refined form of censorship.
This is a very nice question, but its answer seems to be unclear. There is no such thing as a list of forbidden keywords that search engines match against while performing a search. If there were, then it would be simple; you would know which words are unacceptable and simply not use them in your pages or in your meta tags. But unfortunately this is not the case.
Even if you try to perform a search (not a safe search, just a search) using as keywords some vulgar expressions taken from any “Forbidden English (French, Spanish, etc.) Dictionary” you will retrieve an amazing number of search results. Of course, if you use any (parental control) filtering software that prevents you from getting “vulgar” results, you will not have the chance to visit these sites, but this is not the point.
The point is that, while there are no officially announced mechanisms that search engines use for filtering inappropriate keywords, there is enough room for censorship and manipulation of any kind. With vulgar words it is easier; just use a safe search option and the most shockin results will be omitted. Rumor states that search engines have a separate database for storing pages intended for safe searching.
But vulgar words are one side of the filtering coin. And it seems that this coin has many sides. Politically incorrect keywords, for example, are another often discriminated-against category of search keywords. Well known are the censorship/filtering issues in China, Saudi Arabia, and North Korea. The so-called “Great China Firewall” encompassed thousands of keywords that, when found on a page, prevented it from opening.
Freedom of speech and thought seem to be pretty vague concepts when commercial and political interests are at stake. Filtering (or pressure for filtering) comes from many directions. One of the most well known cases is Google against the Chinese government. While using the same search query from different locations does produce a slightly or visibly different set of search results, and this can be defended by stating that the results were sorted based on their (geographic) relevance, in China the case was different. An entire set of search results used to be substituted with another set of links. Or when users tried to search Google, they were redirected to other, “convenient” search engines. The second technique was easily noticeable, while the first one was a more subtle manipulation. Similar issues about search results censorship and manipulation have been reported in Saudi Arabia.
But sometimes the most worrying forms of filtering go unreported. For instance, it is a common practice for employers to monitor the Internet activity of their employees. If there is a mechanism for substituting search results at the gateway of the company, and this goes unnoticed, it is likely that this practice is more widespread than people may realize. Probably there are no agreements between companies and search engines to filter search results for their employees, simply because search engines will never risk their reputation by indulging in such an activity. Still, one can bet that there has always been, and will continue to be, pressure from companies to make search engines exclude from the visible part of Web information companies feel is “inappropriate.”
Similar cases of pressing Google to exclude information involved the Church of Scientology (which attempted to make the search engine not show results they deemed inappropriate) and the French courts (over showing results for Nazi-related sites). Since these cases were loudly discussed in the media and they attracted a lot of attention, they become known to the general public. How many similar cases are never discovered and discussed, and go unnoticed for years?
Besides governments, companies, and organizations, pressure for filtering comes from parents as well. Keeping in mind all the violence and pornography on the Net, it is not surprising that parents take measures to prevent their children from seeing stuff that may be harmful to their mental health. The way parents exercise control over their children’s browsing activity is generally by directly banning particular sites, search strings or access to search engines as a whole rather than substituting search results (often by using software or with the assistance of their ISP).
Depending on where filtering is done, there are several possible ways to do it. If filtering is done by an ISP or at the company gateway, usually there is no substitution of the search results originally retrieved from the search engine. Instead, when users try to visit the site, they get a message of a network problem, connection timeout or just a plain-text message that this site is banned. It is also possible, if the search string itself contains forbidden words, that the firewall will not allow the search to be performed, and again a message with an explanation of the reason will be shown to the user.
Usually ISPs or company gateways filter based on port and IP, but advanced technology allows for more subtle content-based filtering. Simple keyword filtering is most often accomplished by using either whitelists, or blacklists. Whitelists are lists of sites matching particular keywords, which users are explicitly allowed to visit, and all other traffic is blocked. Blacklists are just the opposite – the list of sites that match the criteria for forbidden keywords, and users are therefore not allowed to visit. Most “nannyware” (software for controlling kids’ access to Internet) functions on the principle of whitelists and blacklists.
It is obvious that lists of allowed/forbidden keywords employ a simple tactic, which is prone to a high percentage of error. Even the most thoughtful planning cannot include all allowed keywords or exclude all forbidden ones. Things get more complicated when one remembers that it is not unusual for a word to have several context-dependent meanings, not all of them necessarily bad or good. Also, there are two-root words or phrases that contain a “forbidden” word, but the meaning of the whole word or phrase has nothing to do with the meanings of the words separately. In all of these cases, there is a potential risk of filtering good sites together with bad ones, or allowing bad sites together with good ones.
Although filtering software gets more and more sophisticated and allows the user to customize it, it still makes many mistakes in correctly distinguishing sites. And when images are concerned, the results of filtering are often not exactly the desired ones. Since filtering is based either on the description in the <alt> tag (where one can write anything), or on the abundant presence of flesh tones (this is how pornography is separated from other pictures), it is not unusual to see such software ban a Renaissance painting (because of the nudity in the picture) and allow an objectionable one (where besides flesh, there is furniture, for example, which helps to fool the censorware).
Another approach to filtering forbidden keywords is meta tags. Meta tags are not used by Google (or at least not extensively), but are still used by many other search engines. With this approach it is clear that, if a forbidden word is contained in the meta tag, most likely the page will not be displayed in the search results at all, or will be shown somewhere in the second hundred or thousand of search results.
Meta tags are a two-edged sword. If there is a forbidden word in them, they are banned. On the other hand, if an allowed word describing the page is missing, parent control software might classify the site as not rated and skip it. The reason for the second case is that some of the nannyware programs rely on meta tags in order to rate the site, and if the meta tags do not say that the site is a rated one, it might automatically be included in the list of banned sites.
Two examples illustrating when search results are not displayed, although it is hardly due to filtering issues, are when the server, where a site has been hosted, has been changed. Until the searchbot indexes it again at its new location, it might be excluded from search results. In the past, there were such cases of missing results after a DNS change. Another example for different search results at the same time but in different places is Google France and Google Germany. There are strong reasons to believe that, unlike the censorship cases in China, the difference in search results is due to sites “lost in translation” rather than straightforward, deliberate censorship.
The most sophisticated and powerful form of filtering for forbidden keywords is content analysis. It is also the most manipulative one, because it allows substitution of the set of URLs retrieved from the search engine. The case of China revealed some unpleasant facts about potential second uses of technologies. For instance, routers can recognize separate portions of data. This ability is used mainly to discover and stop worms and viruses, but the same principle can be applied to discovering and stopping certain content.
The precise mechanisms for content analysis are not known in detail, but generally it is done based on profile, which describes what content can be included in search results and what must be skipped. The profile is regarded as a query, and the results that match this query are retrieved from the search engine. Content filtering in this case is done dynamically, based on the current search query by the user. When search results are retrieved from the search engine, they are matched against the profile, and everything that is not acceptable for this profile is deleted from the search results. The remaining set of URLs is then returned to the user, who thinks that these are all the search results for his or her query.
Content analysis and filtering can be done either by the search engine or by the ISP (when acting as a proxy). It is obvious that analyzing each page in a set of hundreds of search results does take significant computing power and slows down search results. For complex queries with many keywords, it might be a very time-consuming operation to perform a search. Automated text analysis and filtering can be executed at the ISP proxy server, on the client’s machine, on the Web host or by Web crawlers. Some of the nannyware products implement it as part of their filtering abilities.