I'm trying to compare two song titles (usually latinscript) for
similarity. So Im looking for when the two titles seem to be the same
song accounting for spelling mistakes, additional words ectera.
For a number of years I've been doing this for some time by creating a
RAMDirectory, creating a document for one of the sentence and then
doing a search using the other sentence and seeing if we get a good
match. This has worked reasonably well but since improving the
performance of other parts of the application this part has become a
performance bottleneck, not that suprising as Im creating all these
objects just for a one off search, and I have to do this for many
sentence pairs.
So I'm now looking at the simmetric
https://github.com/nickmancol/simmetrics package that has many
algorithms for matching two strings
But I'm not clear on what the best is, I understand Leventstein
Distance but I'm sure there are better things than this now, I think
Lucene uses Cosine Simialrity in some form.
And the missing bit for me is these algorithms no distinction between
comparing two words and two sentences, this seems important for getting
matching so do I need to build something around it, I cant simply match
word1 with 1b, word2 with word2 because one sentence may have additional
words and still be a good match.
Maybe sticking with Lucene is best but using it in a more efficient way.
Looking for some general advice/direction from the lucene experts on how
to proceed!
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org