Thanks for the pointer. I've gone into this in some depth, using the AnalyzerUtils class from the lucene in action book.
It seems that the NGramTokenFilter is only processing part of the string that goes in. It stops tokenising the words part way through. That's why the documents weren't found in results. I've had a look at the source code, and I think it's because the next() function returns null when it hits a token smaller than the min ngram size. For example, if I set the minimum to 3, then a 2-character token will cause it to return null. I'm not sure if this is by design or a bug. either way, at least I know what's causing it now. Cheers -- View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]