Search Engines and Algorithms: Semantic Search

In the final article of this series, I want to take a look at another type of search we mentioned in one of the other articles more in depth: Semantic Search. The Semantics Engine is one that attempts to make sense of the results of a search query in a contextual format. Currently, the use of the sense engine is being used to provide advertising, but eventually we’d like to see this concept used in a free search engine.

Crystal Semantics is the developer of Textonomy Advance, the world’s first Sense Engine based technology. This search tool “…operates by applying human “senses” and concepts that current algorithms, semantic systems and other statistical techniques cannot match. Carefully built by human experts, Crystal Semantics’ unique semantic network provides an understanding of the ‘senses’ of words and terms and the true linguistic relationship between them,” as they say on their website.

“It is the result of 8 years and $8 million investment in research and development, the products were devised by Dr. David Crystal, one of the world’s leading linguists and chair of Crystal Semantics. Among the applications for the products are the context targeting of advertising, categorization of content such as RSS feeds and blogs, and search and navigation.”

Textonomy, unlike current search technologies, which are based solely on statistical algorithms, uses techniques from linguistic science to determine the semantic relationships between words and the contexts in which they occur. What are these semantic relationships, and how are they integrated? The Textonomy Advanced Engine relates dictionary definitions from the entire content of a college dictionary to encyclopedia categorizations derived from multiple reference sources.

The ‘sense engine’ that drives Textonomy is the outcome of a long-term development program in search-linguistics. It started with a program to develop a classification system, created to handle data being compiled for the first edition of The Cambridge Encyclopedia, and was soon extended to deal with the various encyclopedias and fact finders published by Cambridge University Press and later by Penguin Books.

At the time, the database was owned by Cambridge University Press, but in 1997 they sold it to the Dutch electronic publishers, AND, who began to develop it for online use. During the next four years, the classification system was expanded into a ’global data model’, inspiring several applications in document classification and search-engine technology. But when AND went out of business in 2001, the database was acquired by Crystal Reference Systems, founded exclusively to progress the global data model, evolving the central concept of the ‘sense engine’. It is one of the largest semantic networks, growing constantly under the supervision of Professor Crystal and his highly experienced editorial team.

In order to understand the search linguistics that Crystal Semantics employs, and why it is in theory better than current search engines, let’s discuss a few search terms that you may or may not be familiar with.

Boolean Search

A Boolean Search is a combination of terms allowing the inclusion or exclusion from search results of documents containing certain words. This is achieved through the use of operators such as AND, NOT and OR.

Boolean operators consist of the following words, and how they are used:

  • AND or the plus (+) sign – Two or more terms or phrases must be in the description; AND is the default operator.

  • OR – Either one or the other of the multiple terms specified must be in the description.

  • NOT or the minus (-) sign – A term or phrase specified is excluded from the search

Boolean Search is probably the most simple of search matching programs. A good example of a Boolean Search is when you use any of the major search engines like Google or Yahoo with multiple words; it is assumed that the operator AND is being used, in order to search for all the terms. For example, if I perform a search with the phrase: buy plasma TV online, it is assumed that I am looking for all of the words to match my query, and all pages that contain the words buy, plasma, TV, and online will be returned.

Another example would be if I wanted to exclude a search term during my search: buy plasma TV online -Sony now the search algorithm understands that all the relevant results will be returned that include the words buy, plasma, TV, and online, but exclude the pages that contain the word Sony.

It is rare that you will find search engines that do not support Boolean search, and most do it automatically, without your having to enter the Boolean operators. There is once in a while you’ll find a search where Boolean searches aren’t performed automatically. One I used recently was a forum search, in which I had to use the (+) sign in order to have all of the results included, so while this search supported Boolean search, it just wasn’t automatic.

Wildcard Search

Many advanced search engines allow for Wildcard Searching. Wildcards, usually in the form of an asterisk (*) or a question mark (?), are used to substitute for possible letters to make up the spelling of a word.

The single character wildcard search looks for terms that match with the single character replaced. For example, to search for text or test you can use the search: te*t or te?t.

Proximity Search

Some search engines support finding words that are a within a specific distance away from the query term. To do a Proximity Search, you will use the tilde, (~) symbol at the end of a phrase. For example to search for greenhouse and carbon within 10 words of each other in a description use the search: greenhouse carbon~10

Fuzzy Search

You may not be familiar with the concept of “Fuzzy Searches.” A Fuzzy Search is a process that locates web pages that are likely to be relevant to a search argument even when the argument does not exactly correspond to the desired information. A Fuzzy Search is done by means of a Fuzzy Matching program, which returns a list of results based on likely relevance even though search argument words and spellings may not exactly match. Exact and highly relevant matches appear near the top of the list. Subjective relevance ratings, usually as percentages, may be given.

A Fuzzy Matching program can operate like a spell checker and spelling-error corrector. For example, if a user types “Misissippi” into Yahoo or Google (both of which use Fuzzy Matching), a list of hits is returned along with the question, “Did you mean Mississippi?” Alternative spellings, and words that sound the same but are spelled differently, are given. A Fuzzy Matching program can compensate for common input typing errors, as well as errors introduced by optical character recognition (OCR) scanning of printed documents.

Fuzzy Matching programs usually return irrelevant hits as well as relevant ones. Superfluous results are likely to occur for terms with multiple meanings, only one of which is the meaning the user intends. If the user has only a vague or general idea of the topic, or does not know exactly what to look for, the ratio of relevant hits to irrelevant hits tends to be low.

Fuzzy Searching is much more powerful than exact searching when used for research and investigation. Fuzzy Searching is especially useful when researching unfamiliar, foreign-language, or sophisticated terms, the proper spellings of which are not widely known. Fuzzy Searching can also be used to locate individuals based on incomplete or partially inaccurate identifying information.

Historically, to perform a Fuzzy Search, you’d have to include all the variations of a word in the search box manually, including singular and plural variants, as well as misspellings.

However, search engines are becoming better at incorporating plurals, or suggesting variants (like Google does when you misspell a word in a search query), and many perform Fuzzy Searches automatically. eBay’s search engine, Voyager, automatically includes common plurals and known alternate misspellings of words.

Many variant searches can be found in engines that support Fuzzy Search by including a tilde (~) at the end of the word to include all variations of a particular keyword.

Contextual Search

When you are in conversation with someone, they understand the slight differences in meanings of certain words you are saying, based upon the context of the conversation, or the other words spoken before and after. Contextual Search attempts to mimic human conversation by getting the gist of the words around the particular search term. It is in Contextual Search that Crystal Semantics’ Textonomy engine places its focus. This concept is similar in part to Fuzzy Searching, but different in the way that it takes the context of a web page as a whole to deliver relevant content, and not just the word itself.

I had mentioned before that currently, this search engine is only being used for advertising purposes, and the way it works is that it will analyze the content of web pages and return the subject of the page. “It is simple to integrate and in a matter of minutes, our clients can enjoy the benefits of enhancing their ad click-through rates and delivering relevant content,” managing director Ian Saunders states in a press release in late September, 2005. Saunders adds, “It is apparent that while Contextual Advertising is extremely popular, the current solutions are far too simplistic in their approach resulting in irrelevant ads and other content. Crystal Semantics technology enables marketers to deliver any content in true context and without forcing the end user to compromise on their viewing experience or privacy.”

Current Contextual Advertising merely presents ads based on the presence of particular keywords on a page. The obvious problem with this type of system is that keywords could easily be taken out of context, and cause the system to deliver ads that are irrelevant. The consequences of this problem could easily be seen; for example, a company selling a new kitchen knife could have their ad appear next to a news story about a stabbing death, which could be devastating to ad revenue. So with Textonomy Advance, Crystal Semantics goes beyond the prevalent concept of presenting ads based on keywords alone by analyzing the content of a web page, and drawing upon human understanding of language, or linguistics, to determine the relevant ads and the appropriate audience.

“Search engines, advertisers and online retailers must recognize that most English words have a variety of different meanings and contexts, not to mention various dialects and slang terms, that need to be accounted for in order to compile and deliver accurate results,” said professor Crystal. “Unfortunately, search techniques in wide use today fail to adequately take these subtleties into account; resulting in clutter for end users and inaccurate targeting by advertisers.”

Relevant results still seem to be a big problem for search engines. It is the number one reason people fail to find information they need on the internet, and is also the top reason Pay-per-click advertising can be ineffectual, and result in unqualified traffic. Search engines have made good use of Boolean, Fuzzy, and even Wildcard searches, but even still, there is only so much these technologies can accomplish on their own. It is within Contextual Search that I feel the next step for search engines to take resides. I will be watching this closely, in hopes that similar concepts will be incorporated into a public search engine very soon.

Google+ Comments

Google+ Comments