How best to compare tow sentences

Paul Taylor Tue, 02 Dec 2014 02:43:58 -0800

I'm trying to compare two song titles (usually latinscript) forsimilarity. So Im looking for when the two titles seem to be the samesong accounting for spelling mistakes, additional words ectera.

For a number of years I've been doing this for some time by creating aRAMDirectory, creating a document for one of the sentence and thendoing a search using the other sentence and seeing if we get a goodmatch. This has worked reasonably well but since improving theperformance of other parts of the application this part has become aperformance bottleneck, not that suprising as Im creating all theseobjects just for a one off search, and I have to do this for manysentence pairs.

So I'm now looking at the simmetrichttps://github.com/nickmancol/simmetrics package that has manyalgorithms for matching two strings

But I'm not clear on what the best is, I understand LeventsteinDistance but I'm sure there are better things than this now, I thinkLucene uses Cosine Simialrity in some form.

And the missing bit for me is these algorithms no distinction betweencomparing two words and two sentences, this seems important for gettingmatching so do I need to build something around it, I cant simply matchword1 with 1b, word2 with word2 because one sentence may have additionalwords and still be a good match.


Maybe sticking with Lucene is best but using it in a more efficient way.

Looking for some general advice/direction from the lucene experts on howto proceed!


Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

How best to compare tow sentences

Reply via email to