Hi Paul, I don't have much expertise in this area so hopefully others will answer, but maybe this is better than nothing.
I don't know many out-of-the-box solutions for this problem, but I'm sure they exist. Mahout and Carrot2 might be worth investigating. Similarity Metrics: - Jaccard Index. Measures similarity between two sets, so word order is not important. - Levenshtein distance. If you are dealing with user-inputted typos, the Damerau–Levenshtein distance can perform a bit better because it takes into account swapping adjacent letters (e.g. teh -> the). I worked with some code that did this for author names e.g. merge "Barack Obama", "Obama B." and "B. H. Obama". It used a combination of Damerau–Levenshtein distance and Jaccard index. It worked very well for this problem, but unfortunately the code was sparse on documentation and full of magic numbers so I don't know the details. The approach was similar to the approach described in this answer: http://stackoverflow.com/a/11920867/281469 This is an O(n^2) pairwise comparison problem. As your data gets bigger you have to work around this limitation. This problem is known in research literature as the "all-pairs" similarity problem. The paper linked from this repository is a good read on the subject: https://code.google.com/p/google-all-pairs-similarity-search/ One of the ways you can work around this is by using Lucene to limit the amount of comparisons you need to do: - Index your documents. - For each song title do a fuzzy search of the words. - Take the top N results, calculate their similarity with the song title using the metrics (or just use the Lucene score). - Cluster similar titles by song title. This is basically creating a sparse inverted index of your document matrix, so that you can find results that will produce non-zero similarities. This is the most effective way of optimizing performance that I have encountered. Again, I'm sure there are out-of-the-box solutions for this problem, but I don't know what they are. Hope that helps, Barry On Tue, Dec 2, 2014 at 10:38 AM, Paul Taylor <paul_t...@fastmail.fm> wrote: > I'm trying to compare two song titles (usually latinscript) for > similarity. So Im looking for when the two titles seem to be the same song > accounting for spelling mistakes, additional words ectera. > > For a number of years I've been doing this for some time by creating a > RAMDirectory, creating a document for one of the sentence and then doing a > search using the other sentence and seeing if we get a good match. This has > worked reasonably well but since improving the performance of other parts > of the application this part has become a performance bottleneck, not that > suprising as Im creating all these objects just for a one off search, and I > have to do this for many sentence pairs. > > So I'm now looking at the simmetric https://github.com/nickmancol/ > simmetrics package that has many algorithms for matching two strings > > But I'm not clear on what the best is, I understand Leventstein Distance > but I'm sure there are better things than this now, I think Lucene uses > Cosine Simialrity in some form. > > And the missing bit for me is these algorithms no distinction between > comparing two words and two sentences, this seems important for getting > matching so do I need to build something around it, I cant simply match > word1 with 1b, word2 with word2 because one sentence may have additional > words and still be a good match. > > Maybe sticking with Lucene is best but using it in a more efficient way. > > Looking for some general advice/direction from the lucene experts on how > to proceed! > > Paul > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >