Paul, for a pair-wise comparison Cosine Similarity does pretty good for most purposes.
On Wed, Dec 3, 2014 at 10:45 AM, Paul Taylor <paul_t...@fastmail.fm> wrote: > On 03/12/2014 15:14, Barry Coughlan wrote: >> >> Hi Paul, >> >> I don't have much expertise in this area so hopefully others will answer, >> but maybe this is better than nothing. >> >> I don't know many out-of-the-box solutions for this problem, but I'm sure >> they exist. Mahout and Carrot2 might be worth investigating. >> >> Similarity Metrics: >> - Jaccard Index. Measures similarity between two sets, so word order is >> not important. >> - Levenshtein distance. If you are dealing with user-inputted typos, the >> Damerau–Levenshtein distance can perform a bit better because it takes into >> account swapping adjacent letters (e.g. teh -> the). >> >> I worked with some code that did this for author names e.g. merge "Barack >> Obama", "Obama B." and "B. H. Obama". It used a combination of >> Damerau–Levenshtein distance and Jaccard index. It worked very well for this >> problem, but unfortunately the code was sparse on documentation and full of >> magic numbers so I don't know the details. The approach was similar to the >> approach described in this answer: >> http://stackoverflow.com/a/11920867/281469 >> >> This is an O(n^2) pairwise comparison problem. As your data gets bigger >> you have to work around this limitation. This problem is known in research >> literature as the "all-pairs" similarity problem. The paper linked from this >> repository is a good read on the subject: >> https://code.google.com/p/google-all-pairs-similarity-search/ >> >> One of the ways you can work around this is by using Lucene to limit the >> amount of comparisons you need to do: >> - Index your documents. >> - For each song title do a fuzzy search of the words. >> - Take the top N results, calculate their similarity with the song title >> using the metrics (or just use the Lucene score). >> - Cluster similar titles by song title. >> >> This is basically creating a sparse inverted index of your document >> matrix, so that you can find results that will produce non-zero >> similarities. This is the most effective way of optimizing performance that >> I have encountered. >> >> Again, I'm sure there are out-of-the-box solutions for this problem, but I >> don't know what they are. >> >> Hope that helps, >> Barry >> > Thankyou barry I wil spend some time going through your suggestions, in the > library Im looking at I dont seem to have Damerau–Levenshtein but I do have > jaccardSimilarity so if that understands words Ill will give it a try. > > |BlockDistance > ChapmanLengthDeviation > ChapmanMatchingSoundex > ChapmanMeanLength > ChapmanOrderedNameCompoundSimilarity > CosineSimilarity > DiceSimilarity > EuclideanDistance > InterfaceStringMetric > JaccardSimilarity > Jaro > JaroWinkler > Levenshtein > MatchingCoefficient > MongeElkan > NeedlemanWunch > OverlapCoefficient > QGramsDistance > SmithWaterman > SmithWatermanGotoh > SmithWatermanGotohWindowedAffine > Soundex > | > One things, regaridng your lucene based solution I think you have missed an > important point. I am only comparing TWO strings at any time, I dont have a > dataset of thousands of sentences that I want to compare over time I > literally have string a and string b and I just want to compare those, at a > later date Ill have string c and d, but at no point do I have strings > a,b,c,d. I'm not trying to find the best matching string for a single title > just is this String a good match for this song title. > > Paul -- sk...@alum.mit.edu (617) 595-5946 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org