I'm about to start writing a small search engine within my site and I'm wanting to get more into assisting the user in finding the results they want.
I have considered soundex() operations, but heard that they aren't always reliable, since longer more complex soundex() results may be truncated.
I've also looked into Levenshtein Distance, but again have problems. Seems like to get a search-engine to find suggestions based upon their levenshtein-distance from the original search-term, you need to have a list of terms waiting to be compared to everybody's
search term, and that could be nasty on the server forcing a levenshtein equation 1,000+ times everytime somebody submits a search.
Even if you do the tiresome lookups, you'll get dozens of alternatives for each word - how do you decide which to suggest? Seems almost like I need to store search terms, and the number of their returned results in the database, and do levenshtein-comparisons
against those, and sort the close ones by their number of last returned results...make sense?
So I humble myself and come before you all today anybody know how to do this in an efficient way?
I had to implement something similar not too long ago. Had to also do % matching as well. In my case though, I couldn't change the underlying database and I was lucky enough to be able to narrow results down to around 100 or so before having to calculate
it. I ended up settling on a combination of Levenshtein and Double-Metaphone.
If you can add a column to the underlying table, then you might want to look at just calculating the double-metaphone and storing the hashes it creates in a couple columns. Then you can compare on the double-metaphone hashes in SQL vs. calculating on the fly.
levenshtein is not only useful on words, and there are improved soundex ideas about. You will have a lot less sounds codes than words I'd write more but on phone at the minute.
Use a mixture of Soundex (or a variant) and the levenstein function. Essentially the Soundex function tries to combat a person not knowing how a word is spelt (but knowing how it sounds) and a levenstein distance algorithm tries to combat typos.
Googling for Double metaphone will probably find you a better soundex - find the matching codes (primary and secondary for double metaphone, primary for metaphone), weight them according to popularity (if you have the info) and then order according to
Don't forget to stem the words (look for an alternative to the porter stemmer if you have time), and try and add more weight to proper nouns and nouns, a bit of weight to verbs and a negative discriminating weight to determiners.
Then add a neural net so that you can track the best matches (based on which links the user clicked) and feedback to improve accuracy in future searches.
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.