Hi Rajeev, I wrote a filter for generating n-grams a while back; I intended to use it for statistics, but I guess you can also use it for search. I also thought of the "boosting effect" you describe when I implemented it, though I never actually tried whether it works that way.
It's in the Lucene bugzilla section: http://issues.apache.org/bugzilla/show_bug.cgi?id=35456 Let me know if you have any problems with it. If the lucene phrase queries honour position increments correctly, they should work out of the box with my code. (My code adds the unigram first with a position increment of 1, and then all n-grams starting with that unigram with an increment of 0.) Just make sure you use the same Analyzer for the queries as you do for the indexing, so that n-grams are added to the query as well. As regards the suggestion to rather use a very sloppy phrase or span query: I expect this approach to be faster, as it would only use TermQuery/BooleanQuery. You basically trade index size for search speed. If you get this to work with phrase queries, please add a test case demonstrating it. I saw something about n-gram queries in nutch, as far as I remember; does anyone know whether nutch uses n-grams to speed up phrase queries? Regards, Sebastian On Mon, Jul 18, 2005 at 02:27:28PM -0700, Rajesh Munavalli wrote: > At what point do I add n-grams? Does the order in which I add n-grams > affect exact phrase queries later? My questions are > > (1) Should I add all the 1-grams followed by 2-grams followed by > 3-grams..etc sentence by sentence OR > > (2) Add all the 1 grams of entire document first before starting 2-grams > for the entire document? -- Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/] NOTE: New email address! Please update your address book. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]