Re: interpreting scores

Nate Thu, 07 May 2009 21:58:28 -0700

Hi Karl,

No, sometimes there will not be a matching MP3 for a note file. When
this happens, the results I get are very poor. For example, if a song
with a common song word like "love" in the name does not have a
matching note file, then I get a handful of results that contain the
word "love" but are otherwise obviously not a good match. I need some
way to judge the quality of the matches, or possible some other
approach to doing the search that helps avoid false positives.


On your clue, I have been reading about ngrams. Very interesting! I
see it is very useful for spell checking. However, how would I
leverage ngrams for my needs? Would the Lucene SpellChecker classes be
of any use?

I really feel like I'm floundering here. I am more than willing to put
in the work, I just need a push or two in the right directions. :)

Thanks!
-Nate


On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <[email protected]> wrote:
> Nate,
>
> will there always be a correspodning mp3 for any given note sheet?
>
>
> As for analysis, I'd try using ngrams of the complete untokenized file name
> if I was you.
>
> "Michael Jackson Don't Stop 'till You Get Enough" ->
> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on.
>
> See
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html
>
>
>    karl
>
> 7 maj 2009 kl. 08.28 skrev Nate:
>
>> Thanks Anshum.
>>
>> What happens if a search returns only one match, and that match is not
>> very "good"? If scores are only comparable to the scores of other
>> matches in the same search, then the score is effectively meaningless
>> if there is only one match.
>>
>> It seems like a very common need to want to provide a "relevance"
>> metric along with search results. I somewhat understand the
>> complexities after reading this thread and the threads it links...
>> http://www.gossamer-threads.com/lists/lucene/java-user/75002
>> My case is slightly better since I don't care to show users the
>> metric. My queries are simple term and boolean queries.
>>
>> This thread talks about "theoretical maximum score" but quickly loses
>> me. Does this seem like the road to go down, given my needs?
>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075
>>
>> Say I do a search like:
>> Michael Jackson Don't stop until you get enough
>> And this is the top match:
>> Michael Jackson Don't Stop 'till You Get Enough
>> Would it make any sense to do a query with the exact contents of the
>> top match to get a maximum score for that document? Would the
>> resulting percentage be meaningful?
>>
>> -Nate
>>
>>
>> On Wed, May 6, 2009 at 10:08 PM, Anshum <[email protected]> wrote:
>>>
>>> Hi Nate,
>>> The scores are only comparable within the same search and not over
>>> different
>>> searches as the scores are affected by query as well as docs.
>>> About the threshold, I guess you could have count cutoff to get 'x' best
>>> matches. Said so coz I'm not really able to recollect anything which
>>> could
>>> use score as a metric to absolutely cluster 'good' and 'not good'
>>> matches.
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me. The
>>> distinction is yours to draw............
>>>
>>>
>>> On Thu, May 7, 2009 at 6:27 AM, Nate <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> First, the problem I'm trying to solve: I have two folders, each
>>>> containing files. I need to match files in one folder with files in
>>>> the other. Eg:
>>>>
>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>>>> songs/Michael Jackson Don't stop until you get enough.mp3
>>>>
>>>> I provide the notes files, but the song files come from a user's music
>>>> library, so often are not named well. I am attempting to use Lucene to
>>>> find the most likely note file for each song file.
>>>>
>>>> I index the note files, then I use the StandardAnalyzer with carefully
>>>> chosen stop words to search the index. The query uses each word in the
>>>> song file name (w/o extension) as a term. Fuzzy matching is used for
>>>> words with > 4 characters, and the fuzzy percentage is set to be 1 /
>>>> termlength. This works ok so far, though I would love to hear opinions
>>>> on any improvements I could make. This is my first use of Lucene, so
>>>> I'm not sure I've chosen the best approach.
>>>>
>>>> The problem I'm having is: Sometimes there is a song file that has no
>>>> matching note file. In this case I get back results with "low" scores,
>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
>>>> really understand what the scoring means, so I don't know what would
>>>> be a reasonable threshold to ignore scores.
>>>>
>>>> I understand scores are not relevance percentages. I think the scores
>>>> are only useful relative to other scores. Is this right? Are they only
>>>> relative to scores from the same search, or from any search against
>>>> the same index? How can I know if a score is "low", so I can ignore
>>>> matches that aren't very good?
>>>>
>>>> Sorry if this has been discussed before. I have searched around a
>>>> great deal and was unable to find a straight answer.
>>>>
>>>> Thanks!
>>>> -Nate
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: interpreting scores

Reply via email to