RE: n-gram indexing

Chris Hostetter Mon, 01 Aug 2005 09:24:20 -0700

I think you can achieve what you want using Span queries with slop
of 1 for ? or MAX_INT for * -- but you'll probably need to make some
nested queries in order to have 0 slop between "of" and "america"


: Date: Mon, 1 Aug 2005 09:15:02 -0700
: From: Rajesh Munavalli <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: RE: n-gram indexing
:
: Hi Chris,
:          The method you suggested is definitely a good solution. However 
there is one more reason I would like to do n-gram generation at indexing time. 
The examples below are a text search equivalent of what I am trying to do for a 
different kind of data. Anyway the example should convey the message.
:
: Example 1:
: Phrase query:  "united ? of america"
: Aim: "?" should match a single word.
: Possible Outcome: "?" should match "states" in the phrase "united states of 
america".
:
: Example 2:
: Phrase query: "united * america"
: Aim: "*" should match multiple words.
: Possible Outcome: "*" can match one or more words in the phrase "united 
states of america".
:
:          If I am correct, Lucene provides wild card searches only in the 
terms indexed. I want to extend this capability to phrase level.
:
:          There might be other ways to do which I am not aware of. Let me know 
what your thoughts on this. I would really appreciate any suggestions you might 
have.
:
: thanks,
:
: Rajesh Munavalli
:
: -----Original Message-----
: From: [EMAIL PROTECTED] on behalf of Chris Hostetter
: Sent: Fri 7/29/2005 6:11 PM
: To: java-user@lucene.apache.org
: Subject: RE: n-gram indexing
:
: : Document 1:
: : "united states is .... United airlines operates in 50 states. United
: : states government....."
: :
: : Document 2:
: : "united states is .... United airlines operates in 50 states. United
: : some other word states"
: :
: : If you consider the tf-idf weight of individual terms "united" and
: : "states" they would have exact score in both the documents. But the term
: : "united states" should have higher weight in Document 1. The higher
: : weight can be achieved through bi-grams which would include the word
: : "united states". My guess is, lucene will retrieve both documents with
: : equal score.
:
: I believe yyour intuition is correct, but did you try what i suggested
: below? .. i don't believe you have to index all of hte possible n-grams,
: just construct a boolean query containing the n-grams, and boost the
: larger n-grams (by some factor only experimenting will tell you).
:
: you should also try out SpanNearQuery ... I believe it scores thingsw
: higher if they appear closer together -- but i'm not confident of my
: memory in htat case.
:
: Seriously, try what i suggest below ... experimentation is the best way to
: answer questions like this...
:
: : -----Original Message-----
: : Sent: Monday, July 18, 2005 5:11 PM
: : To: java-user@lucene.apache.org
: : Subject: RE: n-gram indexing
: :
: :
: : Your intuition is right, but i can't think of any reason why you need to
: : add the n-grams at indexing time -- or why using phrase queries would be
: : a bad thing in this case.  When you get a multiword query, construct the
: : n-grams of the query words as multiple phrase queries and search for a
: : BooleanQuery of all those phrases (with bigger boosts given to longer
: : phrases)
: :
: : Using your example of "united states of america" generate a query that
: : looks something like...
: :
: :      united states america
: :      "united states"^10 "states america"^10
: :      "united states america"^100
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: n-gram indexing

Reply via email to