First the results….
Fuzzy search accuracy for commonly misspelled drug names using sphinx and solr with different stemmers
A little background:
We are about to add full text searching to our rails project, and we thought we would take a look a few different solutions for search. We have been using solr on an older project and were fairly happy with the results, just a little disappointed in the indexing speed. Also, solr plugins for rails seem to be a little fragmented at this point. acts_as_solr has no real home other than github where you have to play “pick a fork”. There are a few other libraries for solr (eg. Sunspot) in rails but none of them seem to be mature or very active.
We decided to try sphinx with the thinking_sphinx plugin. Our first impressions seemed to be very good. Extremely fast indexing, nice support for indexing associated models, and all around very easy to work with as a developer. But….
We are mostly searching non-English words (eg. last names). While using sphinx for a little while, we began to “feel” the search results were not as good as they had been in solr. With search results and open source search tools, your results basically boil down to what stemmer (algorithm) the tools are using to find the relevant results.
In order to get a quantifiable “score” for search results, I decided to write a little test script that would test the different configurations against a known set of drug names and there common misspellings found here. The script indexed the properly spelled drug names and then searched using the misspelled name. If the result was found in the top 10 results it was considered a match.
Above are the results using different stemmers (sphinx calls this “morphology” options) Solr obviously had the best results.
Solr uses the Levenshtein distance by default if you use the “~” fuzzy search operator. Unfortunately, sphinx does not have the option to use the Levenshtein distance, just soundex, metaphone among others.
So for now we have decided to move back to Solr and most likely a fork of acts_as_solr. Any other options out there for rails and solr?