NPL (Natural Language Processing) Is the Key
It seems that content on the page is the most important factor in identifying the language and the way it is done is with the help of NPL (Natural Language Processing) algorithms for matching text on the page to grammar and structural rules for the specific language.
The concept of grammar and structural rules is especially useful when similar languages that use the same alphabet are concerned. For instance, when performing a search for “лаптоп” (the Bulgarian for “laptop”) on Google, the results varied according to my search preferences. When I limited the search to pages only in Bulgarian (without restricting the country), Google returned about 3,500 pages and when I set no restriction about language and country, I got over 16,000 results from Bulgarian, Russian, Macedonian, Ukrainian, etc. sites, basically all places where they use a Cyrillic alphabet and the word “laptop” is spelled in the same way.
Also, there were major variations in the order of the retrieved search results. The top 10 results from the first search (pages in Bulgarian only) now were in the top 30 and new entries from foreign sites were present in the top 10. It is important to note that most of the pages in the first 10 in both searches had no explicitly defined the language in the <html lang=”xx”> tag, so their language must have been identified by the search engine in some other way.
Similar to Google, AltaVista performed well with the same search string “лаптоп” in Bulgarian, although there were occasional pages in other languages that also use the Cyrillic alphabet. Also, the number of search results (1,470) was half of what Google found, although with Google there were more dublicates (duplicates being results that belonged to the same site or domain). It deserves noting that AltaVista supports wildcards and truncation for Cyrillic as well.
I guess that in addition to distinguishing similar languages that use a similar alphabet, Natural Language Processing could be very useful in distinguishing the varieties and dialects of a particular language, provided that there are explicit differences in grammar, structure, and vocabulary.
There are good reasons to believe that when performing a general search (in this case general means without restricting the language), search engines perform simple text processing of the pages they find. They retrieve everything where the particular string of letters, numbers, and other symbols, encoded as series of bytes occur, just matching the characters and not bothering with details about the text that is surrounding them. On the contrary, when the language for search is explicitly defined, search engines use NPL to deliver a refined set of results. And probably it is the search engine algorithms for Natural Language Processing, together with the grammar and morphological intricacies of particular languages that produce so many results in English and (very often) so few in other languages.
What About Right-to-left and Non-character Languages?
I believe that right-to-left and non-character based languages (like Chinese) are treated in a similar way as far as text processing and NPL are concerned. Probably the stage of text processing differs slightly for non-character based languages, but since they also have repeating patterns that carry the meaning, the principle remains. In practice, when a search is performed, search engines first match the string to find suitable search results and are not interested in language details, unless a search language is specified.
On other hand, for those languages where vowels are generally omitted in writing and where several words are written in the same way (like Hebrew and Arabic), simple matching of found symbols tends to generate less relevant results. This in turn requires NPL in order to get the words in the right context. In addition, the morphological and grammatical structure of these languages tends to be more complicated than the morphological and grammatical structure of English, which also contributes to the way search engines handle these languages.
Mixed-Language Search Query and Cross-language Searching
It is interesting to examine what happens when the search query contains words and expressions in several languages. Since simple text processing is the first stage in retrieving the results, search engines will find where these strings occur regardless of language. Even if you specify to retrieve results in a particular language only, this will not skip sites that have both the selected language and other in it. It seems that the most important in this case is the fact that the sentence of the paragraph where the search string is found is in the specified language rather than it is not the only language on the page. What seems to matter (and it is not surprising) is the order in which search terms are arranged.
Mixed-language search queries are an interesting issue but they are different from cross-language search. Cross-language search is a technology that allows to have a search query in one language and to get results from pages where this query occurs in other languages – i.e. the search query gets translated, which allows to retrieve significantly more results. Currently, to the best of my knowledge, cross-language searches are used in libraries and similar places where there is enough information and they are not supported by the search engines like Google and MSN. There are toolbars and browser add-ons that allow to use cross-language search engines on the Web or at least in a particular site but still they are not the standard.
NLP Applications for Search Engines
As I mentioned, Natural Language Processing is the second (and more sophisticated) stage in retrieving search results. In some cases search engines always perform a kind of NPL in order to retrieve only useful search results and to filter frequently used words that do not carry the meaning, like “A”, “The”, etc. But the most interesting applications of NLP in regard to web search engines are the linguistic analysis that can be performed. Basically, there are several linguistic areas where NLP is useful for search engines: phonetics, morphology, syntax, discourse.
It is obvious the NLP abilities for different languages vary. Part of this is due to the differences in the language themselves and part of it is due to the effort made to develop NLP algorithms for that particular language. For instance, take phonetics. When a language is phonetic (meaning there are no substantial deviations in the way the same sound is spelled), it is much easier to make an engine for let’s say phonetic search.
Morphology deals with the structure and form of words. There are languages with very complex morphologic rules, which makes it more difficult to develop a reliable stemming search algorithm. Stemming is a vital technique for all search engines; stemming reduces the word to its root and finds all words that have the same root but different suffixes, prefixes, tenses, or even plural/singular forms. Needless to say that the more complex morphologic structure of a language makes it more difficult to have error-free results. English is a language with relatively easy morphologic rules and this also contributes to the widely-available stemming abilities of search engines. Just imagine how difficult it is to make a stemming algorithm for a language where pluralaity is formed not by adding an “s” at the end but by changing the root itself or where a 20-character word is actually half a sentence, because it has prefixes and suffixes that normally are separate (definite or indefinite) articles, prepositions, cases or tenses! I believe that the tough morphologic structure of many languages (excluding English) explains why stemming is not so popular as a search option for them.
Syntax is also used in NLP for search engines. It is especially important in search strings consisting of several words, because the search engine has to guess how the words relate to each other. Syntax also deals with word order in sentences and is related to typing a search question in a natural language instead of using special complex query syntax.
Since discourse involves analysis above the sentence level, is useful for real in-depth text searches. For instance, when searching for several words, their proximity on the page is supposed to mean that when they are closer, they are related. Google uses proximity to identify the relevance of a page to search query. AltaVista allows to use the NEAR operator to specify that the search terms must be within 10 words of each other.
When discussing Natural Language Processing applications in search engines, spell-checking and synonyms should also be mentioned. Again, these two categories are so tightly related to the particular language, that it is not surprising that they are not available for so many languages.
What About the Standard Practices in SEO?
All this about NPL said, the good practices in SEO are still valid and should be observed. In addition to identifying the language in the <html lang=”xx”> tag, another tip is to have the titles and headings in the target language. Also, to achieve higher ranking in the target language, there must be enough links on the page and it is almost mandatory that the anchor text is in the target language. If necessary, include two sentences (the translated one and the original) on the same, if this is the only way to make links to the page in the foreign language.
In order to make easier identifying the language of the target, you can use the
hreflang=”bg” attribute. But have in mind that this attribute is not fully supported by all browsers except Mozilla:
<A hreflang=”bg” lang=”en” href=”bulgarian.html”>A page in Bulgarian</A>
This tag will show the language of the target page in an information window about the properties of the link.
Another very important consideration when designing pages is that if you have tables, it is much better to have different cells for different languages. Don’t mix languages in a cell. I mean, for text in a right-to-left language and English mixing content in a cell might be technically impossible but in all other cases, when it can be done, avoid doing it.