Re: interpreting scores

Nate Fri, 08 May 2009 04:14:17 -0700

Is it possible to get a count for how many terms a result matched?
Googling, it doesn't appear to be done easily. I tried it out by
breaking my query into words myself, then doing a search for each one
and keeping track of the results and counts. This way I know if 4 out
of 5 terms matched a document, it is probably a pretty good match. If
1 out of 5 matched then it probably isn't a great match.


1) Is this approach reasonable?
2) What, if anything, do I lose by doing it this way?
3) How could I incorporate ngrams?

Thanks!
-Nate


On Thu, May 7, 2009 at 9:57 PM, Nate <n...@n4te.com> wrote:
> Hi Karl,
>
> No, sometimes there will not be a matching MP3 for a note file. When
> this happens, the results I get are very poor. For example, if a song
> with a common song word like "love" in the name does not have a
> matching note file, then I get a handful of results that contain the
> word "love" but are otherwise obviously not a good match. I need some
> way to judge the quality of the matches, or possible some other
> approach to doing the search that helps avoid false positives.
>
> On your clue, I have been reading about ngrams. Very interesting! I
> see it is very useful for spell checking. However, how would I
> leverage ngrams for my needs? Would the Lucene SpellChecker classes be
> of any use?
>
> I really feel like I'm floundering here. I am more than willing to put
> in the work, I just need a push or two in the right directions. :)
>
> Thanks!
> -Nate
>
>
> On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wet...@gmail.com> wrote:
>> Nate,
>>
>> will there always be a correspodning mp3 for any given note sheet?
>>
>>
>> As for analysis, I'd try using ngrams of the complete untokenized file name
>> if I was you.
>>
>> "Michael Jackson Don't Stop 'till You Get Enough" ->
>> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on.
>>
>> See
>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html
>>
>>
>>    karl
>>
>> 7 maj 2009 kl. 08.28 skrev Nate:
>>
>>> Thanks Anshum.
>>>
>>> What happens if a search returns only one match, and that match is not
>>> very "good"? If scores are only comparable to the scores of other
>>> matches in the same search, then the score is effectively meaningless
>>> if there is only one match.
>>>
>>> It seems like a very common need to want to provide a "relevance"
>>> metric along with search results. I somewhat understand the
>>> complexities after reading this thread and the threads it links...
>>> http://www.gossamer-threads.com/lists/lucene/java-user/75002
>>> My case is slightly better since I don't care to show users the
>>> metric. My queries are simple term and boolean queries.
>>>
>>> This thread talks about "theoretical maximum score" but quickly loses
>>> me. Does this seem like the road to go down, given my needs?
>>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075
>>>
>>> Say I do a search like:
>>> Michael Jackson Don't stop until you get enough
>>> And this is the top match:
>>> Michael Jackson Don't Stop 'till You Get Enough
>>> Would it make any sense to do a query with the exact contents of the
>>> top match to get a maximum score for that document? Would the
>>> resulting percentage be meaningful?
>>>
>>> -Nate
>>>
>>>
>>> On Wed, May 6, 2009 at 10:08 PM, Anshum <ansh...@gmail.com> wrote:
>>>>
>>>> Hi Nate,
>>>> The scores are only comparable within the same search and not over
>>>> different
>>>> searches as the scores are affected by query as well as docs.
>>>> About the threshold, I guess you could have count cutoff to get 'x' best
>>>> matches. Said so coz I'm not really able to recollect anything which
>>>> could
>>>> use score as a metric to absolutely cluster 'good' and 'not good'
>>>> matches.
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Thu, May 7, 2009 at 6:27 AM, Nate <n...@n4te.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> First, the problem I'm trying to solve: I have two folders, each
>>>>> containing files. I need to match files in one folder with files in
>>>>> the other. Eg:
>>>>>
>>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>>>>> songs/Michael Jackson Don't stop until you get enough.mp3
>>>>>
>>>>> I provide the notes files, but the song files come from a user's music
>>>>> library, so often are not named well. I am attempting to use Lucene to
>>>>> find the most likely note file for each song file.
>>>>>
>>>>> I index the note files, then I use the StandardAnalyzer with carefully
>>>>> chosen stop words to search the index. The query uses each word in the
>>>>> song file name (w/o extension) as a term. Fuzzy matching is used for
>>>>> words with > 4 characters, and the fuzzy percentage is set to be 1 /
>>>>> termlength. This works ok so far, though I would love to hear opinions
>>>>> on any improvements I could make. This is my first use of Lucene, so
>>>>> I'm not sure I've chosen the best approach.
>>>>>
>>>>> The problem I'm having is: Sometimes there is a song file that has no
>>>>> matching note file. In this case I get back results with "low" scores,
>>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
>>>>> really understand what the scoring means, so I don't know what would
>>>>> be a reasonable threshold to ignore scores.
>>>>>
>>>>> I understand scores are not relevance percentages. I think the scores
>>>>> are only useful relative to other scores. Is this right? Are they only
>>>>> relative to scores from the same search, or from any search against
>>>>> the same index? How can I know if a score is "low", so I can ignore
>>>>> matches that aren't very good?
>>>>>
>>>>> Sorry if this has been discussed before. I have searched around a
>>>>> great deal and was unable to find a straight answer.
>>>>>
>>>>> Thanks!
>>>>> -Nate
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: interpreting scores

Reply via email to