Well, here is the thing - I don't necessarily want to get results per paragraphs - which your code will do just fine for. I want to have my article titles and sub-headers in the main text field, after I have duplicated them to give the words they contain more weight. So I will not want to return higher PositionIncrement for each instance of a field, just those which I'm interested in (title/headers). Can this be done somehow without injecting a "magic string", as Chris called it? Just so I couldn't be clearer, here is pseudo-code of my case:
doc.add("field", "my title", blah, blah) /// I want to create proximity gap here doc.add("field", "word1 word2 word3", blah, blah) doc.add("field", "word4 word5 word6", blah, blah) doc.add("field", "my sub-header", blah, blah) /// here as well doc.add("field", "word7 word8 word9", blah, blah) IndexWriter.add(doc) >>> You can simply subclass whichever one you choose and override getPositionIncrementGap getPositionIncrementGap is a member function of StandardAnalyzer, not StandardTokenizer. Since my use case is a bit different than what you initially thought, I think I will wait for your thoughts on this. So far I have concluded that I will have to perform a check at StandardTokenizer::next for the "magic string", and if found set the current Token there to have PositionIncrement of about 500. Please let me know if there is a better way to do that (ideally without magic...). You have pretty much understood my use case for position increment 0 - but I thought this is possible to do with customizing a Scorer? I haven't gotten that deep into Lucene myself (yet)... I'm not entirely sure I understand the consequences of storing more than one Term in the same position. What I understood from your explanation is that if I store both "b" and "c" at the same position x, Lucene will get to x for both "b" and "c", meaning this could save me query inflation, or as I first suggested, auto-apply synonyms. The only question is, I guess, are there any drawbacks for using this? Thanks. Itamar. -----Original Message----- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Monday, March 31, 2008 4:25 PM To: java-user@lucene.apache.org Subject: Re: setPositionIncrement questions See below... On Mon, Mar 31, 2008 at 7:02 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > > Chris, > > Thanks for your input. > > Please let me make sure that I get this right: while iterating through > the words in a document, I can use my tokenizer to > setPositionIncrement(150) on a specific token, what would make it be > more distant from the previous token than it should have been. The > next token will already have position increment of 1 and therefore > will immediately follow that token, with no extra handling. If I get > this right, the best way to achieve that is by appending a predefined > string like $$$, such that will not occur accidently in my documents, > and have my tokenizer set the position increment as well instead of > just tokenizing upon it. Not really. Somewhere in the indexing code is something that behaves like this... say you have the following lines... doc.add("field", "word1 word2 word3", blah, blah) doc.add("field", "word4 word5 word6", blah, blah) doc.add("field", "word7 word8 word9", blah, blah) IndexWriter.add(doc). Now say your analyzer returns 100 for getPositionIncrementGap. The words will have the following offsets word1 - 0 word2 - 1 word3 - 2 word4 - 103 (perhaps 102, but you get the idea) word5 - 104 word6 - 105 word7 - 206 word8 - 207 word9 - 208 There's no need to have any special tokens for this to occur. > > > >>> Lucene will call the "getPositionIncrementGap" method on your > Analyzer > to determine how much positionIncreiment to put in between the last > token of the first Field and the first token of the second Field -- so > you could just pass each paragraph as a seperate Field instance > > This sounds good, but is risky, since I will have to concatenate my > paragraphs that I DO want to have proximity data in between, and if I > forget to, or accidently don't do that this will corrupt > proximity-based searches. > My documents can become very big as well. I guess what I was looking > for was a simpler way - say tell Lucene when I do doc.add(new Field) > to set the position increment for the last token. The "magic char > sequence" will do, but I was wondering if there is a way to do that > without ammending my Tokenizer? > No, you must deal with your tokenizer, but this is pretty trivial. You can simply subclass whichever one you choose and override getPositionIncrementGap. This seems no riskier that adding your special token since you have to deal with differentiating between paragraphs you *do* want to be adjacent and ones you *don't* in that case as well. Or am I missing something? As to size of documents, somewhere you do need to worry about exceeding a position of 2^31, but if that's really an issue you have other problems <G>. Although this somewhat depends upon how far you need the paragraphs to be apart. Are you going to allow proximity searches of 10,000,000? Or 10? > > >>> it means the words appear at the same position > > ... And what does this mean exactly? How can this affect standard > searches? > What I might do with this is store stems side-by-side with the > original word. From what I've heard so far this is NOT how you do this > for English texts - you rather store them in a different field, why is > that? I thought if you store them side-by-side you could write a > Scorer (or similar) that will return all relevant results for the stem > of a given word, boosting words with the same exact syntax more than others. Any ideas on that? > I don't really understand what you're trying to accomplish, a use case would help. So this may be totally off base.... the words "in the same position" means that if you store, say, blivet and blort at the same position, and the next token is bonkers, then the following two matches will be found: "blivet bonkers" "blort bonkers" (these are as exact pharses). You can answer much of this by getting a copy of Luke and examining test indexes you build. To boost exact matches, you have to do some fancy dancing. For instance, you could store the original word with a special token (say $) at the end, and *also* the stemmed version at the same position. Then you have to mangle your queries to produce something like (word$^10 OR <stemmed version of word>) for each search term. Best Erick > > Itamar. > > -----Original Message----- > From: Chris Hostetter [mailto:[EMAIL PROTECTED] > Sent: Sunday, March 30, 2008 8:56 AM > To: Lucene Users > Subject: Re: setPositionIncrement questions > > > : Breaking proximity data has been discussed several times before, and > : concluded that setPositionIncrement is the way to go. In regards of it: > : > : 1. Where should it be called exactly to create the gap properly? > > any part of your Analyzer can set the position increment on any token > to indicate how far after the previous token it should be. > > : 2. Is there a way to call it directly somehow while indexing (e.g. > after > : adding a new paragraph to an existing field) instead of appending > $$$ > : for example after the new string I'm indexing, and having to update > my > : tokenizer and filters so they will retain the $$$ chars, indicating > the > : gap request? > > if you add multiple Fields with the same name, Lucene will call the > "getPositionIncrementGap" method on your Analyzer to determine how > much positionIncreiment to put in between the last token of the first > Field and the first token of the second Field -- so you could just > pass each paragraph as a seperate Field instance .. alternately you > can have a single Field instance, and your Analyzer can use whatever > mechanims it wants to decide to set the position incriment to > something high (a line break, a magic char sequence you put in the > string, ... whatever you want) > > : 3. What is the recommended value to pass setPositionIncrement to > create > : a reasonable gap, and not risk large documents being indexed > improperly > : (I mean, is there some sort of high-bound for the position value?). > > MAX_INT .. pick gaps based on your data and the queries you expect (if > you want gaps betwen paragraps, and your paragraphs tend to be under > 200 words long, make the gap 500 so "lucene java"~300 can find those > words in the same paragram, but can never span multiple paragraphs > > : 4. What are the consequences of setting PositionIncrement to 0? Does > : this mean I can index synonyms or stems aside of the "real" words > : without risking data corruption? > > it means the words appear at the same position - synonyms is a great > example of this use case. > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]