Re: interpreting scores

Karl Wettin Fri, 08 May 2009 10:09:20 -0700


8 maj 2009 kl. 13.13 skrev Nate:

Is it possible to get a count for how many terms a result matched?

Currently I think you can only do that by using Searcher.explain().But that is not a very nice solution. A better solution is beeingworked on and might be available in a few months or so.



   karl

Googling, it doesn't appear to be done easily. I tried it out by
breaking my query into words myself, then doing a search for each one
and keeping track of the results and counts. This way I know if 4 out
of 5 terms matched a document, it is probably a pretty good match. If
1 out of 5 matched then it probably isn't a great match.

1) Is this approach reasonable?
2) What, if anything, do I lose by doing it this way?
3) How could I incorporate ngrams?

Thanks!
-Nate


On Thu, May 7, 2009 at 9:57 PM, Nate <n...@n4te.com> wrote:
Hi Karl,

No, sometimes there will not be a matching MP3 for a note file. When
this happens, the results I get are very poor. For example, if a song
with a common song word like "love" in the name does not have a
matching note file, then I get a handful of results that contain the
word "love" but are otherwise obviously not a good match. I need some
way to judge the quality of the matches, or possible some other
approach to doing the search that helps avoid false positives.

On your clue, I have been reading about ngrams. Very interesting! I
see it is very useful for spell checking. However, how would I
leverage ngrams for my needs? Would the Lucene SpellChecker classesbe
of any use?
I really feel like I'm floundering here. I am more than willing toput
in the work, I just need a push or two in the right directions. :)

Thanks!
-Nate
On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wet...@gmail.com>wrote:
Nate,

will there always be a correspodning mp3 for any given note sheet?
As for analysis, I'd try using ngrams of the complete untokenizedfile name
if I was you.

"Michael Jackson Don't Stop 'till You Get Enough" ->
"^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja",and so on.
See
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html


   karl

7 maj 2009 kl. 08.28 skrev Nate:
Thanks Anshum.
What happens if a search returns only one match, and that matchis not
very "good"? If scores are only comparable to the scores of other
matches in the same search, then the score is effectivelymeaningless
if there is only one match.

It seems like a very common need to want to provide a "relevance"
metric along with search results. I somewhat understand the
complexities after reading this thread and the threads it links...
http://www.gossamer-threads.com/lists/lucene/java-user/75002
My case is slightly better since I don't care to show users the
metric. My queries are simple term and boolean queries.
This thread talks about "theoretical maximum score" but quicklyloses
me. Does this seem like the road to go down, given my needs?
http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075

Say I do a search like:
Michael Jackson Don't stop until you get enough
And this is the top match:
Michael Jackson Don't Stop 'till You Get Enough
Would it make any sense to do a query with the exact contents ofthe
top match to get a maximum score for that document? Would the
resulting percentage be meaningful?

-Nate


On Wed, May 6, 2009 at 10:08 PM, Anshum <ansh...@gmail.com> wrote:
Hi Nate,
The scores are only comparable within the same search and not over
different
searches as the scores are affected by query as well as docs.
About the threshold, I guess you could have count cutoff to get'x' bestmatches. Said so coz I'm not really able to recollect anythingwhich
could
use score as a metric to absolutely cluster 'good' and 'not good'
matches.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com
The facts expressed here belong to everybody, the opinions tome. The
distinction is yours to draw............


On Thu, May 7, 2009 at 6:27 AM, Nate <n...@n4te.com> wrote:
Hi all,

First, the problem I'm trying to solve: I have two folders, each
containing files. I need to match files in one folder withfiles in
the other. Eg:

notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
songs/Michael Jackson Don't stop until you get enough.mp3
I provide the notes files, but the song files come from auser's musiclibrary, so often are not named well. I am attempting to useLucene to
find the most likely note file for each song file.
I index the note files, then I use the StandardAnalyzer withcarefullychosen stop words to search the index. The query uses each wordin thesong file name (w/o extension) as a term. Fuzzy matching isused forwords with > 4 characters, and the fuzzy percentage is set tobe 1 /termlength. This works ok so far, though I would love to hearopinionson any improvements I could make. This is my first use ofLucene, so
I'm not sure I've chosen the best approach.
The problem I'm having is: Sometimes there is a song file thathas nomatching note file. In this case I get back results with "low"scores,such as 0.2 or 0.05. A "really good" match gives me 7 or 8. Idon'treally understand what the scoring means, so I don't know whatwould
be a reasonable threshold to ignore scores.
I understand scores are not relevance percentages. I think thescoresare only useful relative to other scores. Is this right? Arethey onlyrelative to scores from the same search, or from any searchagainstthe same index? How can I know if a score is "low", so I canignore
matches that aren't very good?

Sorry if this has been discussed before. I have searched around a
great deal and was unable to find a straight answer.

Thanks!
-Nate

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: interpreting scores

Reply via email to