Matching Strings and Algorithms - Equivalence Methods
(Page 2 of 5 )
Equivalence methods compare two strings and return a value of true or false according to whether the method deems those two strings to be, in some sense, equivalent. In terms of user-interface design, your application can be more forgiving of user inputs if it accepts equivalent strings instead of only exact matches. A simple example of equivalence is to treat ‘Tweetle-Beetle Battle’ the same as ‘TWEETLE BEETLE BATTLE’ despite the differences in case, and the replacement of a hyphen with a space in the second string.
Word Stemming
Word stemming is a technique that reduces closely related words to a basic canonical form or ‘stem’. For example, the user inputs ‘swims’ and ‘swimming’ can be reduced to the basic stem ‘swim’ before performing an exact match against expected inputs. Stemming makes use of a suffix dictionary that contains lists of possible word endings. However, such a list is clearly language-dependent and even regional differences of the same language must be considered (for example, compare British spelling ‘standardise’ with American spelling ‘standardize’). Also, not all languages lend themselves to such treatment, although it has been demonstrated for most languages of the Indo-European family (which includes Latin-based and Germanic languages). Deriving stemming algorithms is a difficult, time-consuming and error-prone activity. Therefore, for application building, I can only recommend using tools such as
Snowball, with its suite of existing stemming algorithms for many languages.
Synonyms
In this approach, synonyms of expected inputs are stored explicitly. For example, with the string ‘television’, you might also store ‘TV’ and ‘televisions’; and with the string ‘license’ you might also store ‘licence’. As with word stemming, user inputs are converted to a canonical form before any further processing.
This mechanism can provide a forgiving user-interface, and is also language independent. Unfortunately, it does mean that many synonym strings must be prepared in advance to anticipate every possible user input. You could argue that the approach simply increases the number of expected inputs, rather than providing a better algorithm to find the strings that are of real interest. However, if you reconsider the problem of retrieving product descriptions from a database, you should see that there are other advantages. Firstly, as synonyms are resolved close to the user-interface, you can index your products in the system’s back end using a small, controlled keyword vocabulary. Secondly, the architecture provides a clean separation between these user-interface concerns and the integrity of the data.
Next: Wildcards and Regular Expressions >>
More Choosing Keywords Articles
More By Simon White