Hi Otis, The original message text is:
> Hi, > > I'd appreciate if someone could explain the results I'm getting. > > I've written a simple custom analyzer that applies the NGramTokenFilter to > the token stream during indexing. It's never applied during searching. The > purpose of this is to match sub-words. > > Without the ngram filter, if I searched on the word 'postcode' it returns > 2 documents. If I searched on 'code' it returns 6 documents (with no > overlap on the postcode results). > > If I apply the ngram filter with min 1 and max 10, searching 'postcode' > returns the same 2 docs, while searching 'code' returns 9 docs. This > sort-of feels right. > > The problem comes when I set the min ngram size to 3, and the max to 5. > Searching 'postcode' returns no results (as expected), but searching > 'code' only returns 2 docs (the 2 normally returned by a 'postcode' > search). > > This last result for 'code' just doesn't seem correct - it should be > returning at least the 6 docs from the original search. > > I'd really appreciate some advice on what is going on with the ngram > filter. > > Thanks > Otis Gospodnetic wrote: > > This actually sounds bugish to me, but you removed the text from your > original email, so I don't know what context this was in. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: gaz77 <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Friday, August 29, 2008 12:50:46 AM >> Subject: Re: Confused with NGRAM results >> >> >> Thanks for the pointer. >> >> I've gone into this in some depth, using the AnalyzerUtils class from the >> lucene in action book. >> >> It seems that the NGramTokenFilter is only processing part of the string >> that goes in. It stops tokenising the words part way through. That's why >> the >> documents weren't found in results. >> >> I've had a look at the source code, and I think it's because the next() >> function returns null when it hits a token smaller than the min ngram >> size. >> For example, if I set the minimum to 3, then a 2-character token will >> cause >> it to return null. >> >> I'm not sure if this is by design or a bug. either way, at least I know >> what's causing it now. >> >> Cheers >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19253386.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]