Re: n-gram indexing

Sebastian Marius Kirsch Sun, 24 Jul 2005 10:44:18 -0700

Hi Rajeev,

I wrote a filter for generating n-grams a while back; I intended to
use it for statistics, but I guess you can also use it for search. I
also thought of the "boosting effect" you describe when I implemented
it, though I never actually tried whether it works that way.

It's in the Lucene bugzilla section:

http://issues.apache.org/bugzilla/show_bug.cgi?id=35456

Let me know if you have any problems with it.

If the lucene phrase queries honour position increments correctly,
they should work out of the box with my code. (My code adds the
unigram first with a position increment of 1, and then all n-grams
starting with that unigram with an increment of 0.) Just make sure you
use the same Analyzer for the queries as you do for the indexing, so
that n-grams are added to the query as well.

As regards the suggestion to rather use a very sloppy phrase or span
query: I expect this approach to be faster, as it would only use
TermQuery/BooleanQuery. You basically trade index size for search
speed.

If you get this to work with phrase queries, please add a test case
demonstrating it.

I saw something about n-gram queries in nutch, as far as I remember;
does anyone know whether nutch uses n-grams to speed up phrase
queries?

Regards, Sebastian

On Mon, Jul 18, 2005 at 02:27:28PM -0700, Rajesh Munavalli wrote:
> At what point do I add n-grams? Does the order in which I add n-grams
> affect exact phrase queries later? My questions are
> 
> (1) Should I add all the 1-grams followed by 2-grams followed by
> 3-grams..etc sentence by sentence OR
> 
> (2) Add all the 1 grams of entire document first before starting 2-grams
> for the entire document?

-- 
Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/]

NOTE: New email address! Please update your address book.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: n-gram indexing

Reply via email to