Your understanding is correct: there are two ways to affect the indexed position.
Either approach would work, but if you do the single-field approach, the challenge is in making a TokenFilter that knows when one chunk ended so it could set the position increment. I think it'd be easier to just add multiple field instances? Mike McCandless http://blog.mikemccandless.com On Sun, Sep 15, 2013 at 5:14 AM, Alan Burlison <alan.burli...@gmail.com> wrote: > Firstly, some context. I'm indexing a large set of mbox files which contain > multiple email messages, each mbox file being related to a single issue. I'm > therefore indexing each mbox as a single document, treating each individual > mail as a section of the same document. > > To control matching across mails I want to set the position increment. I'm > trying to decide how best to do this - either by setting the increment > between tokens within a single field or by using multiple instances of a > field and setting the increment between each field instance. > > Much of the information I've found related to position increments seems to > refer to Lucene 3 and things seem to be quite a bit different in 4. I think > I've figured out what is going on, but would appreciate someone confirming > if I'm right or not. > > It looks as if position increments can potentially occur in two places: > > 1. Between each token in a field. It looks like the > PositionIncrementAttribute can be used to to pass a value to the tokenizer > that is processing a field. > > 2. Between multiple instances of a given field within a document. It looks > like the getPositionIncrementGap method on Analyzer can be overridden to set > the position increment between each field instance. > > However, from looking from the source, it appears that nearly all the > tokenizers ignore any values passed in a PositionIncrementAttribute and only > use PositionIncrementAttribute to notify other parts of the processing chain > of the value they actually used (normally 1). There's a filter to manipulate > inter-token positions (PositionFilter), but the documentation says this: > > Deprecated. > (4.4) PositionFilter makes TokenStream graphs inconsistent which can cause > highlighting bugs. > > All of which makes it seem that manipulating the inter-token position > increment isn't particularly useful. > > The second mechanism - overriding Analyzer.getPositionIncrementGap - does > seem to work, but that obviously means putting each segment of the mbox file > into a new field instance. Is that the preferred approach? > > Thanks, > > -- > Alan Burlison > -- > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org