Matching Strings and Algorithms - Similarity Ranking Methods
(Page 4 of 5 )
Similarity Ranking Methods
Similarity ranking methods compare a given string to a set of strings and rank those strings in order of similarity. To produce a ranking, we need a way of saying that one match is better than another. This is done by returning a numeric measure of similarity as the result of each comparison. Alternatively, you can think of the distance between two strings, instead of their similarity. Strings with a large distance between them have low similarity, and vice versa.
Two very common methods for ranking similarity are the Longest Common Sub-string and Edit Distance.
Longest Common SubstringThe longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.
A disadvantage of this approach is that the position of an ‘error’ in the input affects the computed similarity between the two strings. If the error occurs in the middle of the string, then the distance between the two strings will be greater than if the error occurred at one end. For example, suppose we make a simple typing error on the keyboard by pressing the key adjacent to the one intended. With a word such as ‘PINEAPPLE’, typing ‘PINESPPLE’ gives a longest common substring of length 4, whereas ‘OINEAPPLE’ gives a value of 8. The problem is that ‘PINESPPLE’ is deemed to be just as good a match with ‘PINEAPPLE’ as the string ‘PINE’, which is probably not what you want.
Edit Distance
This method focuses on the most common typing errors, namely character omissions, insertions, substitutions and reversals. The idea is to compute the minimum number of such operations that it would take to transform one string into another. This number gives an indication of the similarities of the strings. A value of 0 indicates that the two strings are identical
The algorithm can be described more generally by associating a cost with each of the operations, and deriving the distance between two strings as the minimum cost that transforms one string into another. There are two widely recognized variations of the edit distance. The Levenshtein Edit Distance is the most common variation and allows insertion, deletion or substitution of a single character where the cost of each operation is 1. The Damerau Edit Distance is identical to the Levenshtein edit distance, except that it also allows the operation of transposing (swapping) two adjacent characters.
Implementations of the Levenshtein edit distance can be found on the
here.
Hamming DistanceAlthough I do not recommend the Hamming distance for the majority of string-based information retrieval tasks, I should mention it for completeness. The Hamming distance between two character strings is the number of positions in which the characters of the two strings are different. So, for example, the distance between ‘REPAIR’ and ‘REPOSE’ is 3, whereas the distance between ‘WORK’ and ‘REST’ is 4. (According to the literature, strings of different lengths have an infinite Hamming distance between them, so when comparing strings of different lengths, you may decide to ‘cheat’ by padding the shorter of the two strings on the right with extra spaces before comparing the strings.)
The problem with this metric is that apparently very similar strings can be given a high Hamming distance. Consider that there are no two pairs of characters that match positionally in the following two strings:
• ‘THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG’
• ‘MY QUICK BROWN FOX JUMPED OVER THE LAZY DOGS’
Next: Conclusions >>
More Choosing Keywords Articles
More By Simon White