Wow Karl, thank you so much for writing this up! It was a great help! I have the ngram tokenizing working as you described. Searches are very good!
In order to verify the hits are of high quality, I use the Smith-Waterman algorithm. Other approximate string comparisons I evaluated didn't work well because the scores change with the string length. Eg... Song file: red hot chili pepper under bridge Lucene match: red hot chili peppers under bridge Song name of hit for quality verification: under bridge The song file we searched with is compared with the song name of the hit. Most algorithms give something like 0.6. I don't know which parts of the song file are the artist, if any, but if it is there then artists with longer names score lower with these algorithms. However, Smith-Waterman gives 1.0 (perfect match). If bridge were misspelled, it would still get a high score. In most cases, if there is no good match, I can detect it and show no results. There is one problem though, take this example where there is no "good" match: Song file: u2 beautiful day Lucene match: christina aguilera beautiful Song name of hit for quality verification: beautiful When I plug "u2 beautiful day" and "beautiful" into Smith-Waterman, I get 1.0 and so I allow this incorrect hit to be first place. I don't think there is really a way around this problem though. I guess maybe the artist name could be required to be somewhere in the file path. Overall I am pretty happy with my matching! I still need to do more testing on some friends' dirty (the organization, not the content ;)) music libraries, but from my initial testing I think it works very nicely. Thanks again! Any further advice or comments is very much welcomed. :) -Nate On Fri, May 8, 2009 at 10:06 AM, Karl Wettin <karl.wet...@gmail.com> wrote: > Ngrams can be use for lots of stuff. In your case it has nothing to do with > spellchecking, it was the "until" vs. "'till" that made me think of them as > they would allow you to get at least partial matching of the text. Also, > ngrams gives you a bit of phrase functionallity. > > Create the grams by passing a token containing the complete text to > NgramTokenFilter. Something like this: > > TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't > Stop 'till You Get Enough")); > ts = new LowerCaseFilter(ts); > ts = new NgramTokenFilter(ts, 4, 4); > document.add(new Field("ngrams", ts)); > > At query time you create a query the same way, perhaps something like this: > > TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't > Stop 'till You Get Enough")); > ts = new LowerCaseFilter(ts); > ts = new NgramTokenFilter(ts, 4, 4); > BooleanQuery bq = new BooleanQuery(); > Token token; > while ((token = ts.next(new Token()) != null) { > bq.add(new BooleanClause(new TermQuery("ngrams", token.text()), > BooleanClause.SHOULD); > } > > You might want to fiddle around with the gram sizes. > > In order to detect a match you might want to take a look at the distance > between scores. > > 0.98 : Michael Jackson, Don't stop 'till you get enough > 0.96 : Michael Jackson, Don't stop until you get enough > 0.07 : Barry White, Can't get enough of your love babe > 0.06 : Depeche Mode, Just can't get enough > > Manual inspection of a bunch of queries that match the correct document vs. > queries that looks for something that does not exist in your index will give > you an indication of what distance from the top score is intersting. It is > very probable that if you search for something that does not exist then the > top hits will all have very similar score, and if you search for something > that does exist then there will be a rather large gap between the correct > hit and the first non correct hit. > > In order to fortify the response with knowledge that it really was a good > hit in the top you might want to use some edit distance measure such as > Levenstein or perhaps even something like the Jaccard index. > > I've actually done something similar to this. It worked almost flawless but > as my data required 100% certainty someone had to manually check the data to > avoid errors. I think that if you can settle with something like 95% > certainty then it should be possible to have this automated. > > > > karl > > 8 maj 2009 kl. 06.57 skrev Nate: > >> Hi Karl, >> >> No, sometimes there will not be a matching MP3 for a note file. When >> this happens, the results I get are very poor. For example, if a song >> with a common song word like "love" in the name does not have a >> matching note file, then I get a handful of results that contain the >> word "love" but are otherwise obviously not a good match. I need some >> way to judge the quality of the matches, or possible some other >> approach to doing the search that helps avoid false positives. >> >> On your clue, I have been reading about ngrams. Very interesting! I >> see it is very useful for spell checking. However, how would I >> leverage ngrams for my needs? Would the Lucene SpellChecker classes be >> of any use? >> >> I really feel like I'm floundering here. I am more than willing to put >> in the work, I just need a push or two in the right directions. :) >> >> Thanks! >> -Nate >> >> >> On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wet...@gmail.com> wrote: >>> >>> Nate, >>> >>> will there always be a correspodning mp3 for any given note sheet? >>> >>> >>> As for analysis, I'd try using ngrams of the complete untokenized file >>> name >>> if I was you. >>> >>> "Michael Jackson Don't Stop 'till You Get Enough" -> >>> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so >>> on. >>> >>> See >>> >>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html >>> >>> >>> karl >>> >>> 7 maj 2009 kl. 08.28 skrev Nate: >>> >>>> Thanks Anshum. >>>> >>>> What happens if a search returns only one match, and that match is not >>>> very "good"? If scores are only comparable to the scores of other >>>> matches in the same search, then the score is effectively meaningless >>>> if there is only one match. >>>> >>>> It seems like a very common need to want to provide a "relevance" >>>> metric along with search results. I somewhat understand the >>>> complexities after reading this thread and the threads it links... >>>> http://www.gossamer-threads.com/lists/lucene/java-user/75002 >>>> My case is slightly better since I don't care to show users the >>>> metric. My queries are simple term and boolean queries. >>>> >>>> This thread talks about "theoretical maximum score" but quickly loses >>>> me. Does this seem like the road to go down, given my needs? >>>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 >>>> >>>> Say I do a search like: >>>> Michael Jackson Don't stop until you get enough >>>> And this is the top match: >>>> Michael Jackson Don't Stop 'till You Get Enough >>>> Would it make any sense to do a query with the exact contents of the >>>> top match to get a maximum score for that document? Would the >>>> resulting percentage be meaningful? >>>> >>>> -Nate >>>> >>>> >>>> On Wed, May 6, 2009 at 10:08 PM, Anshum <ansh...@gmail.com> wrote: >>>>> >>>>> Hi Nate, >>>>> The scores are only comparable within the same search and not over >>>>> different >>>>> searches as the scores are affected by query as well as docs. >>>>> About the threshold, I guess you could have count cutoff to get 'x' >>>>> best >>>>> matches. Said so coz I'm not really able to recollect anything which >>>>> could >>>>> use score as a metric to absolutely cluster 'good' and 'not good' >>>>> matches. >>>>> >>>>> -- >>>>> Anshum Gupta >>>>> Naukri Labs! >>>>> http://ai-cafe.blogspot.com >>>>> >>>>> The facts expressed here belong to everybody, the opinions to me. The >>>>> distinction is yours to draw............ >>>>> >>>>> >>>>> On Thu, May 7, 2009 at 6:27 AM, Nate <n...@n4te.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> First, the problem I'm trying to solve: I have two folders, each >>>>>> containing files. I need to match files in one folder with files in >>>>>> the other. Eg: >>>>>> >>>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >>>>>> songs/Michael Jackson Don't stop until you get enough.mp3 >>>>>> >>>>>> I provide the notes files, but the song files come from a user's music >>>>>> library, so often are not named well. I am attempting to use Lucene to >>>>>> find the most likely note file for each song file. >>>>>> >>>>>> I index the note files, then I use the StandardAnalyzer with carefully >>>>>> chosen stop words to search the index. The query uses each word in the >>>>>> song file name (w/o extension) as a term. Fuzzy matching is used for >>>>>> words with > 4 characters, and the fuzzy percentage is set to be 1 / >>>>>> termlength. This works ok so far, though I would love to hear opinions >>>>>> on any improvements I could make. This is my first use of Lucene, so >>>>>> I'm not sure I've chosen the best approach. >>>>>> >>>>>> The problem I'm having is: Sometimes there is a song file that has no >>>>>> matching note file. In this case I get back results with "low" scores, >>>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't >>>>>> really understand what the scoring means, so I don't know what would >>>>>> be a reasonable threshold to ignore scores. >>>>>> >>>>>> I understand scores are not relevance percentages. I think the scores >>>>>> are only useful relative to other scores. Is this right? Are they only >>>>>> relative to scores from the same search, or from any search against >>>>>> the same index? How can I know if a score is "low", so I can ignore >>>>>> matches that aren't very good? >>>>>> >>>>>> Sorry if this has been discussed before. I have searched around a >>>>>> great deal and was unable to find a straight answer. >>>>>> >>>>>> Thanks! >>>>>> -Nate >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org