interpreting scores

Nate Wed, 06 May 2009 17:58:01 -0700

Hi all,

First, the problem I'm trying to solve: I have two folders, each
containing files. I need to match files in one folder with files in
the other. Eg:


notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
songs/Michael Jackson Don't stop until you get enough.mp3

I provide the notes files, but the song files come from a user's music
library, so often are not named well. I am attempting to use Lucene to
find the most likely note file for each song file.

I index the note files, then I use the StandardAnalyzer with carefully
chosen stop words to search the index. The query uses each word in the
song file name (w/o extension) as a term. Fuzzy matching is used for
words with > 4 characters, and the fuzzy percentage is set to be 1 /
termlength. This works ok so far, though I would love to hear opinions
on any improvements I could make. This is my first use of Lucene, so
I'm not sure I've chosen the best approach.

The problem I'm having is: Sometimes there is a song file that has no
matching note file. In this case I get back results with "low" scores,
such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
really understand what the scoring means, so I don't know what would
be a reasonable threshold to ignore scores.

I understand scores are not relevance percentages. I think the scores
are only useful relative to other scores. Is this right? Are they only
relative to scores from the same search, or from any search against
the same index? How can I know if a score is "low", so I can ignore
matches that aren't very good?

Sorry if this has been discussed before. I have searched around a
great deal and was unable to find a straight answer.

Thanks!
-Nate

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

interpreting scores

Reply via email to