Let me explain a scenario where I would need to add the n-grams at indexing time.
Consider two documents: Document 1: "united states is .... United airlines operates in 50 states. United states government....." Document 2: "united states is .... United airlines operates in 50 states. United some other word states" If you consider the tf-idf weight of individual terms "united" and "states" they would have exact score in both the documents. But the term "united states" should have higher weight in Document 1. The higher weight can be achieved through bi-grams which would include the word "united states". My guess is, lucene will retrieve both documents with equal score. Thanks, Rajesh -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Hostetter Sent: Monday, July 18, 2005 5:11 PM To: java-user@lucene.apache.org Subject: RE: n-gram indexing Your intuition is right, but i can't think of any reason why you need to add the n-grams at indexing time -- or why using phrase queries would be a bad thing in this case. When you get a multiword query, construct the n-grams of the query words as multiple phrase queries and search for a BooleanQuery of all those phrases (with bigger boosts given to longer phrases) Using your example of "united states of america" generate a query that looks something like... united states america "united states"^10 "states america"^10 "united states america"^100 : Intution behind adding n-grams is to boost naturally occurring larger : phrases versus using phrase queries. For example, if I am searching for : "united states of america", I want the search results to return the : documents ordered as follows : : Rank 1 - Documents containing all the words occurring together : Rank 2 - Documents containing maximum number of words in the same : sentence : Rank 3 - Documents containing all the words but some might appear in the : same sentence some may not : Rank 4 - Documents containig atleast one or two words : : If we have a n-gram index, most probably document talking about "united : states" gets preference over document containing "united" and "states" : seperately. If I am correct, this can be achieved without using phrase : queries. I am not sure if there is a better way to achieve the same : effect. : : Thanks, : : Rajesh : : : -----Original Message----- : From: Andy Roberts [mailto:[EMAIL PROTECTED] : Sent: Monday, July 18, 2005 5:56 PM : To: java-user@lucene.apache.org : Subject: Re: n-gram indexing : : On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote: : > At what point do I add n-grams? Does the order in which I add n-grams : > affect exact phrase queries later? My questions are : > : > (1) Should I add all the 1-grams followed by 2-grams followed by : > 3-grams..etc sentence by sentence OR : > : > (2) Add all the 1 grams of entire document first before starting : > 2-grams for the entire document? : > : > What is the general accepted notion of adding n-grams of a document? : > : > thanks, : > : > Rajesh : : I can't see any real advantage of storing n-grams explicitly. Just index : the document and use phrase queries. Order is significant with phrase : queries if I recall correctly, although you can use SpanNearQueries to : look for unordered ngrams, although I don't know why you would want to! : : Perhaps if you explain a little more about what you are trying to : achieve more generally, we can confirm that you don't need to mess with : explicit indexing of indexing. : : Andy : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]