Choosing Keywords
  Home arrow Choosing Keywords arrow Page 3 - Matching Strings and Algorithms
SEO Chat Forums  
Choosing Keywords  
Google Optimization  
Link Trading  
MSN Optimization  
Search Engine News  
Search Engine Spiders  
Search Optimization  
Web Directories  
Website Marketing  
Website Promotion  
Website Submission  
Yahoo Optimization  
SEO Tools
Adsense Calculator
AdSense Preview
Advanced Meta-Tags
Alexa Rank Tool
Check Server Headers
Class C Checker
Code to Text Ratio
CPM Calculator
Domain Age Check
Domain Typos
Future PageRank
Google Dance
Google Keywords
Google Search
Google Suggest
Google vs Yahoo
Indexed Pages
Keyword Cloud
Keyword Density
Keyword Difficulty
Keyword Optimizer
Keyword Position
Keyword Typos
Link Popularity
Link Price Calculator
Meta Analyzer
Meta Tag Generator
Multiple Link Popularity
Page Comparison
Page Size
PageRank Lookup
PageRank Search
Robots.txt Generator
ROI Calculator 
S.E. Comparison 
S.E. Keyword Position 
Site Link Analyzer 
Spider Simulator 
URL Redirect Check 
URL Rewriting 
Mobile Linux 
APP Generation ROI 
IBM® developerWorks 
SEO Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
CHOOSING KEYWORDS

Matching Strings and Algorithms
By: Simon White
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 20
    2004-03-15

    Table of Contents:
  • Matching Strings and Algorithms
  • Equivalence Methods
  • Wildcards and Regular Expressions
  • Similarity Ranking Methods
  • Conclusions

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Matching Strings and Algorithms - Wildcards and Regular Expressions


    (Page 3 of 5 )

    Wildcards and Regular Expressions
    When you search for a file on your hard disk, you might use a search pattern with a wildcard such as ‘*.txt’ (anything with the suffix ‘.txt’). In a similar way, applications that perform information retrieval can also employ pattern matching to improve the chances of finding the information of interest. One approach is to expose the full power of regular expressions in the user-interface, but the complex functionality and cryptic syntax usually confuses more than it helps. What we need is a way that harnesses the power of pattern matching without exposing it to the user. One idea is to prepend and append the user’s input with the wild card character, and then use regular expression matching instead of exact matching. This has the effect of searching for all strings that contain the user’s input. Another idea is to take each word (that is, space-separated token) of the input and apply the same wild card prepending and appending. In this case, the input ‘go fish’ would become ‘*go* *fish*’, which matches ‘gone fishing’ as well as ‘go fishing’.

    The Soundex Algorithm
    The Soundex algorithm is an attempt to match strings that sound alike. The idea is that you take the two strings of the comparison, map each of them to a new string that represents their phonetics, and then compare those strings for an exact match. The algorithm is only intended to work with English pronunciation, and there are plenty of counter-examples, even in English, where it doesn’t work. However, it is easy to implement and, even better, is already available as a pre-programmed function in the Oracle Database Management System. There’s also a good chance that you are able to find an implementation in your favorite programming language by a quick web search.

    The algorithm works as follows. When mapping the original strings to their phonetic strings, the first letter is always retained, and the rest of the string is processed in a left to right fashion. The subsequent letters of the string are compressed to a three digit code according to the scheme shown in Table 1. Since the first letter is always retained, the algorithm always generates a 4 digit string. The code ‘0’ is used as padding if there are not enough letters in the input string, and any excess letters are disregarded.


    LetterPhonetic Code
    B,F,P,V

    1

    C,G,J,K,Q,S,X,Z

    2

    D,T

    3

    L

    4

    M,N

    5

    R

    6

    A,E,I,O,U,Y,H,W

    not coded

    Table 1: Phonetic Codes in the Soundex Algorithm

    For example, the strings ‘LICENCE’, ‘LICENSE’ and ‘LICENSING’ all map to the same Soundex string, ‘L252’.

    Additionally,

    • adjacent pairs of the same consonant are treated as one
    • adjacent consonants from the same code group are treated as one
    • a consonant immediately following an initial letter from the same code group is ignored
    • consonants from the same code group separated by W or H are treated as one.
    The Soundex algorithm is interesting because it addresses the pronunciation of words, rather than raw lexical similarity. Its main drawbacks are that it is language dependent, and there are many examples of similar strings that nevertheless produce different Soundex codes. And of course it only provides for comparisons of alphabetic characters - anything outside of the range ‘A’-‘Z’ will simply be ignored.

    The Soundex algorithm is also very old (it is documented in Donald Knuth’s “The Art of Computer Programming", from 1973, but attributed to 1918 and 1922 U.S. Patents by Margaret K. Odell and Robert C. Russell). A more recent attempt at the same problem, called MetaPhone, dates from 1990 and allegedly gives better results. There is a description of MetaPhone on the web, and you can also test the algorithm online against databases of names and place names.

    More Choosing Keywords Articles
    More By Simon White


     

    CHOOSING KEYWORDS ARTICLES

    - Increase Your AdSense Revenue Through Keywor...
    - The Lowdown on Keyword Density
    - Using Calendar-Based Keywords
    - Encourage Conversion: More Advanced Keyword ...
    - Advanced Keyword Research Strategies
    - Keyword Research Tips
    - Think Like a Searcher to Increase Your Traff...
    - Using Search Tools for SEO
    - Effective Keyword Choice Strategy and Useful...
    - Content is King: Information Architecture
    - The Hard Line Keyword Sales Pitch
    - Web Development: Keyword Themes Increase Vis...
    - Integrating Your Keywords into Your Content
    - How to Effectively Choose Your Web Site`s Ke...
    - Thinking About Keywords for PPC Ads



     



    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 2 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek