Crude Google-like "Did you mean?" in JavaScript

Published on 23rd of September 2008. Copyright Tavs Dokkedahl. Displayed 1096 time(s)

As I was writing some basic string functions for my library I came across a PHP funtion called levenshtein. This function, named after its russian author, will compute the levenshtein distance (LD) between two words. The LD is a value indicating how many changes is needed to morph string 1 into string 2. A change is either a letter changed, inserted or deleted. Basically its a value telling how similar two strings are. A value of 0 means the strings are identical, in case and spelling, while a value of say 4 means you have to change, remove or insert 4 letters in the later string.

I thought this could be used for a crude implementation of the "Did you mean?" functionality from Google. I know that Google provide an API for this as a web service but it might be useful to save a server roundtrip and anyway its just plain old programming fun.

A quick search turned up the algorithm which I found at http://www.merriampark.com/ld.htm (courtesy of Michael Gilleland). Its very simple to implement and the worst case running time is O(n*m) where n and m is the lengths of the two strings being compared.

In any practical situation you need a dictionary to match the string against. A dictionary need not be the entire Websters Dictionary of English - it could just be the 500 different products you have for sale at your website.

Let an array of strings represent your products. A user types in a keyword in a search box. Before the request is submitted you determine that the keyword from the search box does not match any products. No need to search but you might as well show what products were close to the keyword.

We need two functions for this application. A method related which finds the strings within a given LD from a dictionary and the LD method itself. The implementation of the levenshtein algorithm can be found at http://www.jslab.dk/library.string.levenshtein.php. For finding those strings in a dictionary d with LD less than or equal to threshold t we define

 1 String.prototype.related =
 2   function(t,d) {
 3     var ld;
 4     // Return this array
 5     var a = [];
 6     // Length of dictionary
 7     var l = d.length;
 8     // for each entry in the dictionary
 9     for(var i=0; i<l; i++) {
10       // If LD of calling string and string at a[i]
11       // is less than t then include a[i] in result
12       ld = this.levenshtein(d[i]);
13       if (ld <= t) {
14         // Save LD and string as we need LD to sort later
15         a.push( {ld:ld, s:d[i]} );
16       }
17     }
18     // Sort by LD ascending
19     a.sort(function(a,b){return a.ld - b.ld});
20     return a;
21   };

The related method creates an array which is sorted like the dictionary is sorted but we probably want to sort it by LD to list the matching strings in the resultset by descending relevance - that is ascending LD. The sort function is just the standard array sort method with a custom function saying that element a is less than, equal to or greater than b if LD of a is less than, equal to or greater the LD of b.

Say your dictionary is implemented as (the words are taken from an open source dictionary)

 1 var dictionary =
 2   [
 3     'coke',
 4     'cokeman',
 5     'cokeney',
 6     'coker',
 7     'coker',
 8     'cokers',
 9     'cokery',
10     'cokeses',
11     'cokewold',
12     'cokey',
13     'cokie',
14     'cokier',
15     'cokiest',
16     'cokneyfy',
17     'cokuloris',
18     'cokuloris'
19   ];

and a keyword of coek is inputted along with a max. LD of 3. No matches are found but the user was likely making a typo and really meant coke. Using the Levensthein approach we can find the following related words

 1 coke   (LD 2)
 2 coker  (LD 2)
 3 coker  (LD 2)
 4 cokey  (LD 2)
 5 cokery (LD 3)
 6 cokers (LD 3)
 7 cokie  (LD 3)
 8 cokier (LD 3)

This is just a minor example of the usefulness of finding similar strings. I put together a small demo with a dictionary of 1843 words for anyone to try out. Hopefully this can come in handy in a variety of situations and should be able to provide better help on mismatching searches.

Enjoy the code
Tavs


Leave a comment

Name

Email (if you want a response)

Comment (no HTML)

Spam challenge
Sorry to bother you but spam is a royal pain, so please answer this simple question to verify that you are in fact human(oid)

Question: "What is the 3 letter acronym for Cascading Style Sheets?"

Answer:

1 comment(s)