Multilingual Sites and Search Engines: Part 1 - Character Set Issues
(Page 4 of 4 )
Although character set issues are not directly related to languages but more to alphabets, it is worth mentioning them in this article. It is not enough to specify the language of the page only; its encoding must be specified as well. The general rule is that one encoding can be used for more than one language (i.e. Windows 1251 is for Cyrillic, and it can be used for Russian, Bulgarian, and other pages). The opposite is also true: there can be more than one encoding (ISO, for Windows, for Mac, and so forth) for a language. Of course, there is Unicode, but it often causes more problems (in the proper displaying of pages) than it solves. Because of this, Web developers are reluctant to use it as an universal approach.
Since encoding is more about display than search, is there a relationship between encoding and search results? Yes, there is. First, it affects indexing. Although most major search engines index pages in any encoding, there are still search engines (starting with national ones) that index only a limited number of charsets. So if your site gets excluded from the search results of a particular search engine, the reason for this could be that pages on the site are in an unsupported charset.
Second, there are search engines which perform indexing and results retrieval of pages with not-so-popular encoding by recoding the character set (i.e. converting it to a different set). This operation (performed back and forth) can also influence search results. This is especially true for languages that have special symbols, for instance accented characters.
Third, for those search engines that allow wild card symbols and truncation, very often these functions are not fully supported for non-Latin charsets.
Content Reveals the Language
It is hardly surprising that when servicing requests for pages in a particular language only, Google determines the language based on the content on the page and on the context in which the search string occurs. How do search engines know so many languages? Well, the answer is simple: they use NPL (Natural Language Processing), i.e. they have some type of database that contains words in different languages, together with some grammar and structural rules specific to that language, which allows them to analyze the text and determine the dominant language of a page. More details about the mechanism of NPL and about other factors that influence sites in foreign languages are included in the second part of the article.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
|
| · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | · | | | | |
|