Tech Off Thread

5 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

"Searching 'Samson'......did you mean 'Sampson'?"

Back to Forum: Tech Off
  • User profile image

    I'm about to start writing a small search engine within my site and I'm wanting to get more into assisting the user in finding the results they want.

    I have considered soundex() operations, but heard that they aren't always reliable, since longer more complex soundex() results may be truncated.

    I've also looked into Levenshtein Distance, but again have problems. Seems like to get a search-engine to find suggestions based upon their levenshtein-distance from the original search-term, you need to have a list of terms waiting to be compared to everybody's search term, and that could be nasty on the server forcing a levenshtein equation 1,000+ times everytime somebody submits a search.

    Even if you do the tiresome lookups, you'll get dozens of alternatives for each word - how do you decide which to suggest? Seems almost like I need to store search terms, and the number of their returned results in the database, and do levenshtein-comparisons against those, and sort the close ones by their number of last returned results...make sense?

    So I humble myself and come before you all today Wink anybody know how to do this in an efficient way?

  • User profile image

    I had to implement something similar not too long ago. Had to also do % matching as well. In my case though, I couldn't change the underlying database and I was lucky enough to be able to narrow results down to around 100 or so before having to calculate it. I ended up settling on a combination of Levenshtein and Double-Metaphone.

    If you can add a column to the underlying table, then you might want to look at just calculating the double-metaphone and storing the hashes it creates in a couple columns. Then you can compare on the double-metaphone hashes in SQL vs. calculating on the fly.

    Here's an article where a guy did it and stored the hash as a UDT.

  • User profile image

    levenshtein is not only useful on words, and there are improved soundex ideas about. You will have a lot less sounds codes than words Smiley I'd write more but on phone at the minute.

  • User profile image

    Use a mixture of Soundex (or a variant) and the levenstein function. Essentially the Soundex function tries to combat a person not knowing how a word is spelt (but knowing how it sounds) and a levenstein distance algorithm tries to combat typos.

  • User profile image

    Googling for Double metaphone will probably find you a better soundex - find the matching codes (primary and secondary for double metaphone, primary for metaphone), weight them according to popularity (if you have the info) and then order according to edit distance.

    Don't forget to stem the words (look for an alternative to the porter stemmer if you have time), and try and add more weight to proper nouns and nouns, a bit of weight to verbs and a negative discriminating weight to determiners.

    Then add a neural net so that you can track the best matches (based on which links the user clicked) and feedback to improve accuracy in future searches.

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.