Multilingual Sites and Search Engines: Part II - NPL Applications and Practices in SEO
(Page 3 of 3 )
NLP Applications for Search Engines
As I mentioned, Natural Language Processing is the second (and more sophisticated) stage in retrieving search results. In some cases search engines always perform a kind of NPL in order to retrieve only useful search results and to filter frequently used words that do not carry the meaning, like "A", "The", etc. But the most interesting applications of NLP in regard to web search engines are the linguistic analysis that can be performed. Basically, there are several linguistic areas where NLP is useful for search engines: phonetics, morphology, syntax, discourse.
It is obvious the NLP abilities for different languages vary. Part of this is due to the differences in the language themselves and part of it is due to the effort made to develop NLP algorithms for that particular language. For instance, take phonetics. When a language is phonetic (meaning there are no substantial deviations in the way the same sound is spelled), it is much easier to make an engine for let's say phonetic search.
Morphology deals with the structure and form of words. There are languages with very complex morphologic rules, which makes it more difficult to develop a reliable stemming search algorithm. Stemming is a vital technique for all search engines; stemming reduces the word to its root and finds all words that have the same root but different suffixes, prefixes, tenses, or even plural/singular forms. Needless to say that the more complex morphologic structure of a language makes it more difficult to have error-free results. English is a language with relatively easy morphologic rules and this also contributes to the widely-available stemming abilities of search engines. Just imagine how difficult it is to make a stemming algorithm for a language where pluralaity is formed not by adding an "s" at the end but by changing the root itself or where a 20-character word is actually half a sentence, because it has prefixes and suffixes that normally are separate (definite or indefinite) articles, prepositions, cases or tenses! I believe that the tough morphologic structure of many languages (excluding English) explains why stemming is not so popular as a search option for them.
Syntax is also used in NLP for search engines. It is especially important in search strings consisting of several words, because the search engine has to guess how the words relate to each other. Syntax also deals with word order in sentences and is related to typing a search question in a natural language instead of using special complex query syntax.
Since discourse involves analysis above the sentence level, is useful for real in-depth text searches. For instance, when searching for several words, their proximity on the page is supposed to mean that when they are closer, they are related. Google uses proximity to identify the relevance of a page to search query. AltaVista allows to use the NEAR operator to specify that the search terms must be within 10 words of each other.
When discussing Natural Language Processing applications in search engines, spell-checking and synonyms should also be mentioned. Again, these two categories are so tightly related to the particular language, that it is not surprising that they are not available for so many languages.
What About the Standard Practices in SEO?
All this about NPL said, the good practices in SEO are still valid and should be observed. In addition to identifying the language in the <html lang="xx"> tag, another tip is to have the titles and headings in the target language. Also, to achieve higher ranking in the target language, there must be enough links on the page and it is almost mandatory that the anchor text is in the target language. If necessary, include two sentences (the translated one and the original) on the same, if this is the only way to make links to the page in the foreign language.
In order to make easier identifying the language of the target, you can use the hreflang="bg" attribute. But have in mind that this attribute is not fully supported by all browsers except Mozilla:
<A hreflang="bg" lang="en" href="bulgarian.html">A page in Bulgarian</A>
This tag will show the language of the target page in an information window about the properties of the link.
Another very important consideration when designing pages is that if you have tables, it is much better to have different cells for different languages. Don't mix languages in a cell. I mean, for text in a right-to-left language and English mixing content in a cell might be technically impossible but in all other cases, when it can be done, avoid doing it.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |