I'm trying to compare two song titles (usually latinscript) for similarity. So Im looking for when the two titles seem to be the same song accounting for spelling mistakes, additional words ectera.

For a number of years I've been doing this for some time by creating a RAMDirectory, creating a document for one of the sentence and then doing a search using the other sentence and seeing if we get a good match. This has worked reasonably well but since improving the performance of other parts of the application this part has become a performance bottleneck, not that suprising as Im creating all these objects just for a one off search, and I have to do this for many sentence pairs.

So I'm now looking at the simmetric https://github.com/nickmancol/simmetrics package that has many algorithms for matching two strings

But I'm not clear on what the best is, I understand Leventstein Distance but I'm sure there are better things than this now, I think Lucene uses Cosine Simialrity in some form.

And the missing bit for me is these algorithms no distinction between comparing two words and two sentences, this seems important for getting matching so do I need to build something around it, I cant simply match word1 with 1b, word2 with word2 because one sentence may have additional words and still be a good match.

Maybe sticking with Lucene is best but using it in a more efficient way.

Looking for some general advice/direction from the lucene experts on how to proceed!

Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to