My interest in string similarity stems from a desire for good user interface design. Computers are seen by many as unfriendly, unforgiving beasts that respond unkindly to requests that are unspecific. In this article, I demonstrate how computers can be programmed to be more forgiving of their users’ mistakes, with no additional burden on the user such as learning a special query format. Moreover, the techniques described are very widely applicable and often easy to implement.
Although my interest is in the user-interface, it is not the only place where such techniques can be employed. For example, the Hamming distance (described later) was traditionally used to recover from low-level bit transfer errors in electronic communications. In the future, I believe some of the techniques could be used to aid communication among independently acting computer programs (intelligent agents) as they try to make sense of what another agent ‘said’. But for now, I would like you to think of realigning an ‘unexpected’ input string with an input that is expected, or known to be valid, in the context of a user-interface.
Let me be a little more concrete. When you enter a search string to look for a book at Amazon.com, your input is matched against the descriptions of known products held in a database. It is quite likely that your input does not exactly match any of the ‘expected’ inputs (that is, book titles or authors) in the database. For example, if you enter the string ‘Web Database Applications’, you would like the search to return the book with the title ‘Web Database Applications with PHP and MySQL’, even though it is not an exact match. And you might also expect to see the same book listed if you entered ‘PHP Web Applications’, or even the misspelling ‘Web Aplications’. The task is therefore to find which of the expected strings (in this case product descriptions) are similar, or perhaps most similar, to the user’s input. There are two main classes of algorithms for matching string similarity, equivalence methods and similarity ranking methods.
Word stemming is a technique that reduces closely related words to a basic canonical form or ‘stem’. For example, the user inputs ‘swims’ and ‘swimming’ can be reduced to the basic stem ‘swim’ before performing an exact match against expected inputs. Stemming makes use of a suffix dictionary that contains lists of possible word endings. However, such a list is clearly language-dependent and even regional differences of the same language must be considered (for example, compare British spelling ‘standardise’ with American spelling ‘standardize’). Also, not all languages lend themselves to such treatment, although it has been demonstrated for most languages of the Indo-European family (which includes Latin-based and Germanic languages). Deriving stemming algorithms is a difficult, time-consuming and error-prone activity. Therefore, for application building, I can only recommend using tools such as Snowball, with its suite of existing stemming algorithms for many languages.
In this approach, synonyms of expected inputs are stored explicitly. For example, with the string ‘television’, you might also store ‘TV’ and ‘televisions’; and with the string ‘license’ you might also store ‘licence’. As with word stemming, user inputs are converted to a canonical form before any further processing.
This mechanism can provide a forgiving user-interface, and is also language independent. Unfortunately, it does mean that many synonym strings must be prepared in advance to anticipate every possible user input. You could argue that the approach simply increases the number of expected inputs, rather than providing a better algorithm to find the strings that are of real interest. However, if you reconsider the problem of retrieving product descriptions from a database, you should see that there are other advantages. Firstly, as synonyms are resolved close to the user-interface, you can index your products in the system’s back end using a small, controlled keyword vocabulary. Secondly, the architecture provides a clean separation between these user-interface concerns and the integrity of the data.
Wildcards and Regular Expressions
When you search for a file on your hard disk, you might use a search pattern with a wildcard such as ‘*.txt’ (anything with the suffix ‘.txt’). In a similar way, applications that perform information retrieval can also employ pattern matching to improve the chances of finding the information of interest. One approach is to expose the full power of regular expressions in the user-interface, but the complex functionality and cryptic syntax usually confuses more than it helps. What we need is a way that harnesses the power of pattern matching without exposing it to the user. One idea is to prepend and append the user’s input with the wild card character, and then use regular expression matching instead of exact matching. This has the effect of searching for all strings that contain the user’s input. Another idea is to take each word (that is, space-separated token) of the input and apply the same wild card prepending and appending. In this case, the input ‘go fish’ would become ‘*go* *fish*’, which matches ‘gone fishing’ as well as ‘go fishing’.
The Soundex Algorithm
The Soundex algorithm is an attempt to match strings that sound alike. The idea is that you take the two strings of the comparison, map each of them to a new string that represents their phonetics, and then compare those strings for an exact match. The algorithm is only intended to work with English pronunciation, and there are plenty of counter-examples, even in English, where it doesn’t work. However, it is easy to implement and, even better, is already available as a pre-programmed function in the Oracle Database Management System. There’s also a good chance that you are able to find an implementation in your favorite programming language by a quick web search.
The algorithm works as follows. When mapping the original strings to their phonetic strings, the first letter is always retained, and the rest of the string is processed in a left to right fashion. The subsequent letters of the string are compressed to a three digit code according to the scheme shown in Table 1. Since the first letter is always retained, the algorithm always generates a 4 digit string. The code ‘0’ is used as padding if there are not enough letters in the input string, and any excess letters are disregarded.
Table 1: Phonetic Codes in the Soundex Algorithm
For example, the strings ‘LICENCE’, ‘LICENSE’ and ‘LICENSING’ all map to the same Soundex string, ‘L252’.
- adjacent pairs of the same consonant are treated as one
- adjacent consonants from the same code group are treated as one
- a consonant immediately following an initial letter from the same code group is ignored
- consonants from the same code group separated by W or H are treated as one.
The Soundex algorithm is also very old (it is documented in Donald Knuth’s “The Art of Computer Programming”, from 1973, but attributed to 1918 and 1922 U.S. Patents by Margaret K. Odell and Robert C. Russell). A more recent attempt at the same problem, called MetaPhone, dates from 1990 and allegedly gives better results. There is a description of MetaPhone on the web, and you can also test the algorithm online against databases of names and place names.
Similarity Ranking Methods
Similarity ranking methods compare a given string to a set of strings and rank those strings in order of similarity. To produce a ranking, we need a way of saying that one match is better than another. This is done by returning a numeric measure of similarity as the result of each comparison. Alternatively, you can think of the distance between two strings, instead of their similarity. Strings with a large distance between them have low similarity, and vice versa.
Two very common methods for ranking similarity are the Longest Common Sub-string and Edit Distance.
Longest Common Substring
The longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.
A disadvantage of this approach is that the position of an ‘error’ in the input affects the computed similarity between the two strings. If the error occurs in the middle of the string, then the distance between the two strings will be greater than if the error occurred at one end. For example, suppose we make a simple typing error on the keyboard by pressing the key adjacent to the one intended. With a word such as ‘PINEAPPLE’, typing ‘PINESPPLE’ gives a longest common substring of length 4, whereas ‘OINEAPPLE’ gives a value of 8. The problem is that ‘PINESPPLE’ is deemed to be just as good a match with ‘PINEAPPLE’ as the string ‘PINE’, which is probably not what you want.
This method focuses on the most common typing errors, namely character omissions, insertions, substitutions and reversals. The idea is to compute the minimum number of such operations that it would take to transform one string into another. This number gives an indication of the similarities of the strings. A value of 0 indicates that the two strings are identical
The algorithm can be described more generally by associating a cost with each of the operations, and deriving the distance between two strings as the minimum cost that transforms one string into another. There are two widely recognized variations of the edit distance. The Levenshtein Edit Distance is the most common variation and allows insertion, deletion or substitution of a single character where the cost of each operation is 1. The Damerau Edit Distance is identical to the Levenshtein edit distance, except that it also allows the operation of transposing (swapping) two adjacent characters.
Implementations of the Levenshtein edit distance can be found on the here.
Although I do not recommend the Hamming distance for the majority of string-based information retrieval tasks, I should mention it for completeness. The Hamming distance between two character strings is the number of positions in which the characters of the two strings are different. So, for example, the distance between ‘REPAIR’ and ‘REPOSE’ is 3, whereas the distance between ‘WORK’ and ‘REST’ is 4. (According to the literature, strings of different lengths have an infinite Hamming distance between them, so when comparing strings of different lengths, you may decide to ‘cheat’ by padding the shorter of the two strings on the right with extra spaces before comparing the strings.)
The problem with this metric is that apparently very similar strings can be given a high Hamming distance. Consider that there are no two pairs of characters that match positionally in the following two strings:
• ‘THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG’
• ‘MY QUICK BROWN FOX JUMPED OVER THE LAZY DOGS’
Many information retrieval systems use a string-based query, often just a single string, to find the information of interest to its user. The ability of the system to find relevant information based on the user’s input is key to a successful system. This ability can be significantly enhanced by employing an approximate string matching algorithm (which need not be invoked until it is known that no exact matches exist). Conversely, failure to find relevant information (particularly when the user knows it to be present) serves only to frustrate, and perpetuate the myth of the cantankerous computer.
I described the algorithms in two classes: equivalence methods and similarity ranking methods. Equivalence methods return a Boolean result, whereas the similarity ranking methods return a numeric similarity measure or distance metric. In information retrieval systems, it is possible to mix methods to produce a faster hybrid approach. A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.
When choosing an algorithm to use, there are several criteria that will influence your choice of algorithm. For example, what kinds of mismatch are you attempting to recover from? Are you trying to recover from typing errors? Or are you trying to find ‘sound-alike’ or look-alike strings? Do the users of the system all speak the same language, or does the method need to be language independent? Do the results need to be ranked in order of similarity? How many strings will the algorithm have to compare, and how fast must it run? Is a two-pass mechanism appropriate or necessary?
Lastly, you might imagine that this area of computing has been so well explored that the best algorithms have already been found and are well-known. However, it is still a research area and I also wouldn’t be at all surprised if the teams at Google are working on novel approximate string-matching algorithms right now!