interpreting scores
Hi all, First, the problem I'm trying to solve: I have two folders, each containing files. I need to match files in one folder with files in the other. Eg: notes/Michael Jackson - Don't Stop 'till You Get Enough.notes songs/Michael Jackson Don't stop until you get enough.mp3 I provide the notes files, but the song files come from a user's music library, so often are not named well. I am attempting to use Lucene to find the most likely note file for each song file. I index the note files, then I use the StandardAnalyzer with carefully chosen stop words to search the index. The query uses each word in the song file name (w/o extension) as a term. Fuzzy matching is used for words with > 4 characters, and the fuzzy percentage is set to be 1 / termlength. This works ok so far, though I would love to hear opinions on any improvements I could make. This is my first use of Lucene, so I'm not sure I've chosen the best approach. The problem I'm having is: Sometimes there is a song file that has no matching note file. In this case I get back results with "low" scores, such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't really understand what the scoring means, so I don't know what would be a reasonable threshold to ignore scores. I understand scores are not relevance percentages. I think the scores are only useful relative to other scores. Is this right? Are they only relative to scores from the same search, or from any search against the same index? How can I know if a score is "low", so I can ignore matches that aren't very good? Sorry if this has been discussed before. I have searched around a great deal and was unable to find a straight answer. Thanks! -Nate - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: interpreting scores
Thanks Anshum. What happens if a search returns only one match, and that match is not very "good"? If scores are only comparable to the scores of other matches in the same search, then the score is effectively meaningless if there is only one match. It seems like a very common need to want to provide a "relevance" metric along with search results. I somewhat understand the complexities after reading this thread and the threads it links... http://www.gossamer-threads.com/lists/lucene/java-user/75002 My case is slightly better since I don't care to show users the metric. My queries are simple term and boolean queries. This thread talks about "theoretical maximum score" but quickly loses me. Does this seem like the road to go down, given my needs? http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 Say I do a search like: Michael Jackson Don't stop until you get enough And this is the top match: Michael Jackson Don't Stop 'till You Get Enough Would it make any sense to do a query with the exact contents of the top match to get a maximum score for that document? Would the resulting percentage be meaningful? -Nate On Wed, May 6, 2009 at 10:08 PM, Anshum wrote: > Hi Nate, > The scores are only comparable within the same search and not over different > searches as the scores are affected by query as well as docs. > About the threshold, I guess you could have count cutoff to get 'x' best > matches. Said so coz I'm not really able to recollect anything which could > use score as a metric to absolutely cluster 'good' and 'not good' matches. > > -- > Anshum Gupta > Naukri Labs! > http://ai-cafe.blogspot.com > > The facts expressed here belong to everybody, the opinions to me. The > distinction is yours to draw > > > On Thu, May 7, 2009 at 6:27 AM, Nate wrote: > >> Hi all, >> >> First, the problem I'm trying to solve: I have two folders, each >> containing files. I need to match files in one folder with files in >> the other. Eg: >> >> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >> songs/Michael Jackson Don't stop until you get enough.mp3 >> >> I provide the notes files, but the song files come from a user's music >> library, so often are not named well. I am attempting to use Lucene to >> find the most likely note file for each song file. >> >> I index the note files, then I use the StandardAnalyzer with carefully >> chosen stop words to search the index. The query uses each word in the >> song file name (w/o extension) as a term. Fuzzy matching is used for >> words with > 4 characters, and the fuzzy percentage is set to be 1 / >> termlength. This works ok so far, though I would love to hear opinions >> on any improvements I could make. This is my first use of Lucene, so >> I'm not sure I've chosen the best approach. >> >> The problem I'm having is: Sometimes there is a song file that has no >> matching note file. In this case I get back results with "low" scores, >> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't >> really understand what the scoring means, so I don't know what would >> be a reasonable threshold to ignore scores. >> >> I understand scores are not relevance percentages. I think the scores >> are only useful relative to other scores. Is this right? Are they only >> relative to scores from the same search, or from any search against >> the same index? How can I know if a score is "low", so I can ignore >> matches that aren't very good? >> >> Sorry if this has been discussed before. I have searched around a >> great deal and was unable to find a straight answer. >> >> Thanks! >> -Nate >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: interpreting scores
Hi Karl, No, sometimes there will not be a matching MP3 for a note file. When this happens, the results I get are very poor. For example, if a song with a common song word like "love" in the name does not have a matching note file, then I get a handful of results that contain the word "love" but are otherwise obviously not a good match. I need some way to judge the quality of the matches, or possible some other approach to doing the search that helps avoid false positives. On your clue, I have been reading about ngrams. Very interesting! I see it is very useful for spell checking. However, how would I leverage ngrams for my needs? Would the Lucene SpellChecker classes be of any use? I really feel like I'm floundering here. I am more than willing to put in the work, I just need a push or two in the right directions. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin wrote: > Nate, > > will there always be a correspodning mp3 for any given note sheet? > > > As for analysis, I'd try using ngrams of the complete untokenized file name > if I was you. > > "Michael Jackson Don't Stop 'till You Get Enough" -> > "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on. > > See > http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html > > > karl > > 7 maj 2009 kl. 08.28 skrev Nate: > >> Thanks Anshum. >> >> What happens if a search returns only one match, and that match is not >> very "good"? If scores are only comparable to the scores of other >> matches in the same search, then the score is effectively meaningless >> if there is only one match. >> >> It seems like a very common need to want to provide a "relevance" >> metric along with search results. I somewhat understand the >> complexities after reading this thread and the threads it links... >> http://www.gossamer-threads.com/lists/lucene/java-user/75002 >> My case is slightly better since I don't care to show users the >> metric. My queries are simple term and boolean queries. >> >> This thread talks about "theoretical maximum score" but quickly loses >> me. Does this seem like the road to go down, given my needs? >> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 >> >> Say I do a search like: >> Michael Jackson Don't stop until you get enough >> And this is the top match: >> Michael Jackson Don't Stop 'till You Get Enough >> Would it make any sense to do a query with the exact contents of the >> top match to get a maximum score for that document? Would the >> resulting percentage be meaningful? >> >> -Nate >> >> >> On Wed, May 6, 2009 at 10:08 PM, Anshum wrote: >>> >>> Hi Nate, >>> The scores are only comparable within the same search and not over >>> different >>> searches as the scores are affected by query as well as docs. >>> About the threshold, I guess you could have count cutoff to get 'x' best >>> matches. Said so coz I'm not really able to recollect anything which >>> could >>> use score as a metric to absolutely cluster 'good' and 'not good' >>> matches. >>> >>> -- >>> Anshum Gupta >>> Naukri Labs! >>> http://ai-cafe.blogspot.com >>> >>> The facts expressed here belong to everybody, the opinions to me. The >>> distinction is yours to draw >>> >>> >>> On Thu, May 7, 2009 at 6:27 AM, Nate wrote: >>> >>>> Hi all, >>>> >>>> First, the problem I'm trying to solve: I have two folders, each >>>> containing files. I need to match files in one folder with files in >>>> the other. Eg: >>>> >>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >>>> songs/Michael Jackson Don't stop until you get enough.mp3 >>>> >>>> I provide the notes files, but the song files come from a user's music >>>> library, so often are not named well. I am attempting to use Lucene to >>>> find the most likely note file for each song file. >>>> >>>> I index the note files, then I use the StandardAnalyzer with carefully >>>> chosen stop words to search the index. The query uses each word in the >>>> song file name (w/o extension) as a term. Fuzzy matching is used for >>>> words with > 4 characters
Re: interpreting scores
Is it possible to get a count for how many terms a result matched? Googling, it doesn't appear to be done easily. I tried it out by breaking my query into words myself, then doing a search for each one and keeping track of the results and counts. This way I know if 4 out of 5 terms matched a document, it is probably a pretty good match. If 1 out of 5 matched then it probably isn't a great match. 1) Is this approach reasonable? 2) What, if anything, do I lose by doing it this way? 3) How could I incorporate ngrams? Thanks! -Nate On Thu, May 7, 2009 at 9:57 PM, Nate wrote: > Hi Karl, > > No, sometimes there will not be a matching MP3 for a note file. When > this happens, the results I get are very poor. For example, if a song > with a common song word like "love" in the name does not have a > matching note file, then I get a handful of results that contain the > word "love" but are otherwise obviously not a good match. I need some > way to judge the quality of the matches, or possible some other > approach to doing the search that helps avoid false positives. > > On your clue, I have been reading about ngrams. Very interesting! I > see it is very useful for spell checking. However, how would I > leverage ngrams for my needs? Would the Lucene SpellChecker classes be > of any use? > > I really feel like I'm floundering here. I am more than willing to put > in the work, I just need a push or two in the right directions. :) > > Thanks! > -Nate > > > On Thu, May 7, 2009 at 7:50 AM, Karl Wettin wrote: >> Nate, >> >> will there always be a correspodning mp3 for any given note sheet? >> >> >> As for analysis, I'd try using ngrams of the complete untokenized file name >> if I was you. >> >> "Michael Jackson Don't Stop 'till You Get Enough" -> >> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on. >> >> See >> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html >> >> >> karl >> >> 7 maj 2009 kl. 08.28 skrev Nate: >> >>> Thanks Anshum. >>> >>> What happens if a search returns only one match, and that match is not >>> very "good"? If scores are only comparable to the scores of other >>> matches in the same search, then the score is effectively meaningless >>> if there is only one match. >>> >>> It seems like a very common need to want to provide a "relevance" >>> metric along with search results. I somewhat understand the >>> complexities after reading this thread and the threads it links... >>> http://www.gossamer-threads.com/lists/lucene/java-user/75002 >>> My case is slightly better since I don't care to show users the >>> metric. My queries are simple term and boolean queries. >>> >>> This thread talks about "theoretical maximum score" but quickly loses >>> me. Does this seem like the road to go down, given my needs? >>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 >>> >>> Say I do a search like: >>> Michael Jackson Don't stop until you get enough >>> And this is the top match: >>> Michael Jackson Don't Stop 'till You Get Enough >>> Would it make any sense to do a query with the exact contents of the >>> top match to get a maximum score for that document? Would the >>> resulting percentage be meaningful? >>> >>> -Nate >>> >>> >>> On Wed, May 6, 2009 at 10:08 PM, Anshum wrote: >>>> >>>> Hi Nate, >>>> The scores are only comparable within the same search and not over >>>> different >>>> searches as the scores are affected by query as well as docs. >>>> About the threshold, I guess you could have count cutoff to get 'x' best >>>> matches. Said so coz I'm not really able to recollect anything which >>>> could >>>> use score as a metric to absolutely cluster 'good' and 'not good' >>>> matches. >>>> >>>> -- >>>> Anshum Gupta >>>> Naukri Labs! >>>> http://ai-cafe.blogspot.com >>>> >>>> The facts expressed here belong to everybody, the opinions to me. The >>>> distinction is yours to draw >>>> >>>> >>>> On Thu, May 7, 2009 at 6:27 AM, Nate wrote: >>>> >>>>> Hi all, >
Re: interpreting scores
Wow Karl, thank you so much for writing this up! It was a great help! I have the ngram tokenizing working as you described. Searches are very good! In order to verify the hits are of high quality, I use the Smith-Waterman algorithm. Other approximate string comparisons I evaluated didn't work well because the scores change with the string length. Eg... Song file: red hot chili pepper under bridge Lucene match: red hot chili peppers under bridge Song name of hit for quality verification: under bridge The song file we searched with is compared with the song name of the hit. Most algorithms give something like 0.6. I don't know which parts of the song file are the artist, if any, but if it is there then artists with longer names score lower with these algorithms. However, Smith-Waterman gives 1.0 (perfect match). If bridge were misspelled, it would still get a high score. In most cases, if there is no good match, I can detect it and show no results. There is one problem though, take this example where there is no "good" match: Song file: u2 beautiful day Lucene match: christina aguilera beautiful Song name of hit for quality verification: beautiful When I plug "u2 beautiful day" and "beautiful" into Smith-Waterman, I get 1.0 and so I allow this incorrect hit to be first place. I don't think there is really a way around this problem though. I guess maybe the artist name could be required to be somewhere in the file path. Overall I am pretty happy with my matching! I still need to do more testing on some friends' dirty (the organization, not the content ;)) music libraries, but from my initial testing I think it works very nicely. Thanks again! Any further advice or comments is very much welcomed. :) -Nate On Fri, May 8, 2009 at 10:06 AM, Karl Wettin wrote: > Ngrams can be use for lots of stuff. In your case it has nothing to do with > spellchecking, it was the "until" vs. "'till" that made me think of them as > they would allow you to get at least partial matching of the text. Also, > ngrams gives you a bit of phrase functionallity. > > Create the grams by passing a token containing the complete text to > NgramTokenFilter. Something like this: > > TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't > Stop 'till You Get Enough")); > ts = new LowerCaseFilter(ts); > ts = new NgramTokenFilter(ts, 4, 4); > document.add(new Field("ngrams", ts)); > > At query time you create a query the same way, perhaps something like this: > > TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't > Stop 'till You Get Enough")); > ts = new LowerCaseFilter(ts); > ts = new NgramTokenFilter(ts, 4, 4); > BooleanQuery bq = new BooleanQuery(); > Token token; > while ((token = ts.next(new Token()) != null) { > bq.add(new BooleanClause(new TermQuery("ngrams", token.text()), > BooleanClause.SHOULD); > } > > You might want to fiddle around with the gram sizes. > > In order to detect a match you might want to take a look at the distance > between scores. > > 0.98 : Michael Jackson, Don't stop 'till you get enough > 0.96 : Michael Jackson, Don't stop until you get enough > 0.07 : Barry White, Can't get enough of your love babe > 0.06 : Depeche Mode, Just can't get enough > > Manual inspection of a bunch of queries that match the correct document vs. > queries that looks for something that does not exist in your index will give > you an indication of what distance from the top score is intersting. It is > very probable that if you search for something that does not exist then the > top hits will all have very similar score, and if you search for something > that does exist then there will be a rather large gap between the correct > hit and the first non correct hit. > > In order to fortify the response with knowledge that it really was a good > hit in the top you might want to use some edit distance measure such as > Levenstein or perhaps even something like the Jaccard index. > > I've actually done something similar to this. It worked almost flawless but > as my data required 100% certainty someone had to manually check the data to > avoid errors. I think that if you can settle with something like 95% > certainty then it should be possible to have this automated. > > > > karl > > 8 maj 2009 kl. 06.57 skrev Nate: > >> Hi Karl, >> >> No, sometimes there will not be a matching MP3 for a note file. When >> this happens, the results I get are very poor. For example, if a song >> with a common song word like "love" in the name does not have a >> matching note file, then I get a handful of results tha
Re: pagination search results
No. http://tinyurl.com/qwhhrp -Nate On Wed, May 13, 2009 at 8:32 PM, Daniel Susanto wrote: > Do you have any example for that?? I'm using JSP, what i need is paging just > like in google search... > > :D > > thx > > Daniel Susanto > http://susantodaniel.wordpress.com > > --- On Tue, 5/12/09, Ian Lea wrote: > > From: Ian Lea > Subject: Re: pagination search results > To: java-user@lucene.apache.org > Date: Tuesday, May 12, 2009, 3:10 PM > > See the FAQ entry "How do I implement paging, i.e. showing result from > 1-10, 11-20 etc?". > > > -- > Ian. > > > On Tue, May 12, 2009 at 6:37 AM, Daniel Susanto > wrote: >> Hi, >> >> Please help, how to do pagination search results? >> >> note: i'm using Java Server Pages. >> >> thx. >> >> Daniel Susanto >> http://susantodaniel.wordpress.com >> >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org