Multilingual Sites and Search Engines: Part II
(Page 1 of 3 )
As already discussed in the first part, multilingual sites face special challenges in regard to search engine optimization and placement in addition to the general challenges and requirements faced by pages in English.
NPL (Natural Language Processing) Is the Key It seems that content on the page is the most important factor in identifying the language and the way it is done is with the help of NPL (Natural Language Processing) algorithms for matching text on the page to grammar and structural rules for the specific language.
The concept of grammar and structural rules is especially useful when similar languages that use the same alphabet are concerned. For instance, when performing a search for "лаптоп" (the Bulgarian for "laptop") on Google, the results varied according to my search preferences. When I limited the search to pages only in Bulgarian (without restricting the country), Google returned about 3,500 pages and when I set no restriction about language and country, I got over 16,000 results from Bulgarian, Russian, Macedonian, Ukrainian, etc. sites, basically all places where they use a Cyrillic alphabet and the word "laptop" is spelled in the same way.
Also, there were major variations in the order of the retrieved search results. The top 10 results from the first search (pages in Bulgarian only) now were in the top 30 and new entries from foreign sites were present in the top 10. It is important to note that most of the pages in the first 10 in both searches had no explicitly defined the language in the <html lang="xx"> tag, so their language must have been identified by the search engine in some other way.
Similar to Google, AltaVista performed well with the same search string "лаптоп" in Bulgarian, although there were occasional pages in other languages that also use the Cyrillic alphabet. Also, the number of search results (1,470) was half of what Google found, although with Google there were more dublicates (duplicates being results that belonged to the same site or domain). It deserves noting that AltaVista supports wildcards and truncation for Cyrillic as well.
I guess that in addition to distinguishing similar languages that use a similar alphabet, Natural Language Processing could be very useful in distinguishing the varieties and dialects of a particular language, provided that there are explicit differences in grammar, structure, and vocabulary.
There are good reasons to believe that when performing a general search (in this case general means without restricting the language), search engines perform simple text processing of the pages they find. They retrieve everything where the particular string of letters, numbers, and other symbols, encoded as series of bytes occur, just matching the characters and not bothering with details about the text that is surrounding them. On the contrary, when the language for search is explicitly defined, search engines use NPL to deliver a refined set of results. And probably it is the search engine algorithms for Natural Language Processing, together with the grammar and morphological intricacies of particular languages that produce so many results in English and (very often) so few in other languages.
Next: Non-Character Languages and Cross-Language Searching >>
More Search Optimization Articles
More By Tsvetanka Stoyanova