a couple of days ago i was wondering how the soundex() function works for identifying words which are phonetically similar. further i was thinking whether the function can take care of the actual pronunciation of words in english. after all english is a funny language at times. there are plenty of words where the "t" is silent and yet you can find words where you pronuounce an invisible "t".

this excitement made me venture into the working of the return values of the soundex function. and i did a hit and trial with different words to find the values returned by the soundex function.

my conclusions are here for your examination:

(a) the soundex() function always returns a four-character string.

(b) the soundex() function does not take care of the dictionary pronunciation. its the spelling of the word that is evaluated.

(c) the entire alphabet is classified into 7 sets of letters (numbers 0 to 6 below). they are mentioned below.

0 > a e h i o u w y and all non alphabetic characters (space,comma, numbers etc.)

1 > b f p v

2 > c g j k q s x z

3 > d t

4 > l

5 > m n

6 > r

(d) to find the soundex value of an input character string, follow the steps below.

as an example, lets assume that "foxite" is our input string.

  • (i) the first letter of the input string is kept as it is in the return value. so the return value will begin with "f"
  • (ii) all subsequent characters are evaluated one by one from the second letter onwards. so we have "f02030"
  • (iii)now take the first 4 non zero non repeating values from the above. so we have "f23"

    the non repeating rule is that in case there are two or more characters with the same numeric code and without a different numeric value in between, the numeric code is to be considered as one only. for example: the word "foxxite" shall be f022030 after step (ii) so we should take only one of the 2's and the value after step (iii) will be "f23". however if the word was "foxoxite" the conversion would be f0202030 after step (ii) and "f223" after step (iii).

  • (iv) the final step is to padr the return string with 0 to make it four characters wide.

    thus the soundex() return value of "foxite" is "f230"

    thus the soundex() return value of "foxide" is also "f230"

    thus the soundex() return value of "focsite" is also "f230"

    thus the soundex() return value of "foczite" is also "f230"

    thus the soundex() return value of "fauccjite" is also "f230"

    some more examples:

    "visual" -> v02004 -> "v240"

    "foxpro" -> f02160 -> "f216"

    (e) the whole idea appears to be the distinction of characters into vowel sounds (number 0) and different consonant sounds (numbers 1 to 6). if one has ever studied shorthand, one can see that the grouping of the consonant sounds like (p,b) (t,d) (m,n) is very similar to that in shorthand as the idea is the same in both the cases.

    (f) finally to conclude, its a great function for detecting spelling mistakes and identifying similar sounding words and also for tracing duplicate records in a table by indexing on the soundex() of the field.

    v500 p260 !!

  • One Response to The Vocal chords of Soundex()

    • Craig Boyd says:

      Excellent dissection of SoundEx(). I’ve worked on a few different sounds-like algorithms for Visual FoxPro (improved soundex, metaphone, double-metaphone) which can be found out on the Fox Wikis. So, this entry was of special interest to me. Keep up the great work and thanks for the Visual FoxPro related blog… I’m subscribing.

    Leave a Reply

    Your email address will not be published. Required fields are marked *