I think you can achieve what you want using Span queries with slop of 1 for ? or MAX_INT for * -- but you'll probably need to make some nested queries in order to have 0 slop between "of" and "america"
: Date: Mon, 1 Aug 2005 09:15:02 -0700 : From: Rajesh Munavalli <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: RE: n-gram indexing : : Hi Chris, : The method you suggested is definitely a good solution. However there is one more reason I would like to do n-gram generation at indexing time. The examples below are a text search equivalent of what I am trying to do for a different kind of data. Anyway the example should convey the message. : : Example 1: : Phrase query: "united ? of america" : Aim: "?" should match a single word. : Possible Outcome: "?" should match "states" in the phrase "united states of america". : : Example 2: : Phrase query: "united * america" : Aim: "*" should match multiple words. : Possible Outcome: "*" can match one or more words in the phrase "united states of america". : : If I am correct, Lucene provides wild card searches only in the terms indexed. I want to extend this capability to phrase level. : : There might be other ways to do which I am not aware of. Let me know what your thoughts on this. I would really appreciate any suggestions you might have. : : thanks, : : Rajesh Munavalli : : -----Original Message----- : From: [EMAIL PROTECTED] on behalf of Chris Hostetter : Sent: Fri 7/29/2005 6:11 PM : To: java-user@lucene.apache.org : Subject: RE: n-gram indexing : : : Document 1: : : "united states is .... United airlines operates in 50 states. United : : states government....." : : : : Document 2: : : "united states is .... United airlines operates in 50 states. United : : some other word states" : : : : If you consider the tf-idf weight of individual terms "united" and : : "states" they would have exact score in both the documents. But the term : : "united states" should have higher weight in Document 1. The higher : : weight can be achieved through bi-grams which would include the word : : "united states". My guess is, lucene will retrieve both documents with : : equal score. : : I believe yyour intuition is correct, but did you try what i suggested : below? .. i don't believe you have to index all of hte possible n-grams, : just construct a boolean query containing the n-grams, and boost the : larger n-grams (by some factor only experimenting will tell you). : : you should also try out SpanNearQuery ... I believe it scores thingsw : higher if they appear closer together -- but i'm not confident of my : memory in htat case. : : Seriously, try what i suggest below ... experimentation is the best way to : answer questions like this... : : : -----Original Message----- : : Sent: Monday, July 18, 2005 5:11 PM : : To: java-user@lucene.apache.org : : Subject: RE: n-gram indexing : : : : : : Your intuition is right, but i can't think of any reason why you need to : : add the n-grams at indexing time -- or why using phrase queries would be : : a bad thing in this case. When you get a multiword query, construct the : : n-grams of the query words as multiple phrase queries and search for a : : BooleanQuery of all those phrases (with bigger boosts given to longer : : phrases) : : : : Using your example of "united states of america" generate a query that : : looks something like... : : : : united states america : : "united states"^10 "states america"^10 : : "united states america"^100 : : : -Hoss : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]