Is it possible to get a count for how many terms a result matched? Googling, it doesn't appear to be done easily. I tried it out by breaking my query into words myself, then doing a search for each one and keeping track of the results and counts. This way I know if 4 out of 5 terms matched a document, it is probably a pretty good match. If 1 out of 5 matched then it probably isn't a great match.
1) Is this approach reasonable? 2) What, if anything, do I lose by doing it this way? 3) How could I incorporate ngrams? Thanks! -Nate On Thu, May 7, 2009 at 9:57 PM, Nate <n...@n4te.com> wrote: > Hi Karl, > > No, sometimes there will not be a matching MP3 for a note file. When > this happens, the results I get are very poor. For example, if a song > with a common song word like "love" in the name does not have a > matching note file, then I get a handful of results that contain the > word "love" but are otherwise obviously not a good match. I need some > way to judge the quality of the matches, or possible some other > approach to doing the search that helps avoid false positives. > > On your clue, I have been reading about ngrams. Very interesting! I > see it is very useful for spell checking. However, how would I > leverage ngrams for my needs? Would the Lucene SpellChecker classes be > of any use? > > I really feel like I'm floundering here. I am more than willing to put > in the work, I just need a push or two in the right directions. :) > > Thanks! > -Nate > > > On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wet...@gmail.com> wrote: >> Nate, >> >> will there always be a correspodning mp3 for any given note sheet? >> >> >> As for analysis, I'd try using ngrams of the complete untokenized file name >> if I was you. >> >> "Michael Jackson Don't Stop 'till You Get Enough" -> >> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on. >> >> See >> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html >> >> >> karl >> >> 7 maj 2009 kl. 08.28 skrev Nate: >> >>> Thanks Anshum. >>> >>> What happens if a search returns only one match, and that match is not >>> very "good"? If scores are only comparable to the scores of other >>> matches in the same search, then the score is effectively meaningless >>> if there is only one match. >>> >>> It seems like a very common need to want to provide a "relevance" >>> metric along with search results. I somewhat understand the >>> complexities after reading this thread and the threads it links... >>> http://www.gossamer-threads.com/lists/lucene/java-user/75002 >>> My case is slightly better since I don't care to show users the >>> metric. My queries are simple term and boolean queries. >>> >>> This thread talks about "theoretical maximum score" but quickly loses >>> me. Does this seem like the road to go down, given my needs? >>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 >>> >>> Say I do a search like: >>> Michael Jackson Don't stop until you get enough >>> And this is the top match: >>> Michael Jackson Don't Stop 'till You Get Enough >>> Would it make any sense to do a query with the exact contents of the >>> top match to get a maximum score for that document? Would the >>> resulting percentage be meaningful? >>> >>> -Nate >>> >>> >>> On Wed, May 6, 2009 at 10:08 PM, Anshum <ansh...@gmail.com> wrote: >>>> >>>> Hi Nate, >>>> The scores are only comparable within the same search and not over >>>> different >>>> searches as the scores are affected by query as well as docs. >>>> About the threshold, I guess you could have count cutoff to get 'x' best >>>> matches. Said so coz I'm not really able to recollect anything which >>>> could >>>> use score as a metric to absolutely cluster 'good' and 'not good' >>>> matches. >>>> >>>> -- >>>> Anshum Gupta >>>> Naukri Labs! >>>> http://ai-cafe.blogspot.com >>>> >>>> The facts expressed here belong to everybody, the opinions to me. The >>>> distinction is yours to draw............ >>>> >>>> >>>> On Thu, May 7, 2009 at 6:27 AM, Nate <n...@n4te.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> First, the problem I'm trying to solve: I have two folders, each >>>>> containing files. I need to match files in one folder with files in >>>>> the other. Eg: >>>>> >>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >>>>> songs/Michael Jackson Don't stop until you get enough.mp3 >>>>> >>>>> I provide the notes files, but the song files come from a user's music >>>>> library, so often are not named well. I am attempting to use Lucene to >>>>> find the most likely note file for each song file. >>>>> >>>>> I index the note files, then I use the StandardAnalyzer with carefully >>>>> chosen stop words to search the index. The query uses each word in the >>>>> song file name (w/o extension) as a term. Fuzzy matching is used for >>>>> words with > 4 characters, and the fuzzy percentage is set to be 1 / >>>>> termlength. This works ok so far, though I would love to hear opinions >>>>> on any improvements I could make. This is my first use of Lucene, so >>>>> I'm not sure I've chosen the best approach. >>>>> >>>>> The problem I'm having is: Sometimes there is a song file that has no >>>>> matching note file. In this case I get back results with "low" scores, >>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't >>>>> really understand what the scoring means, so I don't know what would >>>>> be a reasonable threshold to ignore scores. >>>>> >>>>> I understand scores are not relevance percentages. I think the scores >>>>> are only useful relative to other scores. Is this right? Are they only >>>>> relative to scores from the same search, or from any search against >>>>> the same index? How can I know if a score is "low", so I can ignore >>>>> matches that aren't very good? >>>>> >>>>> Sorry if this has been discussed before. I have searched around a >>>>> great deal and was unable to find a straight answer. >>>>> >>>>> Thanks! >>>>> -Nate >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org