Your understanding is correct: there are two ways to affect the
indexed position.

Either approach would work, but if you do the single-field approach,
the challenge is in making a TokenFilter that knows when one chunk
ended so it could set the position increment.

I think it'd be easier to just add multiple field instances?

Mike McCandless

http://blog.mikemccandless.com


On Sun, Sep 15, 2013 at 5:14 AM, Alan Burlison <alan.burli...@gmail.com> wrote:
> Firstly, some context. I'm indexing a large set of mbox files which contain
> multiple email messages, each mbox file being related to a single issue. I'm
> therefore indexing each mbox as a single document, treating each individual
> mail as a section of the same document.
>
> To control matching across mails I want to set the position increment. I'm
> trying to decide how best to do this - either by setting the increment
> between tokens within a single field or by using multiple instances of a
> field and setting the increment between each field instance.
>
> Much of the information I've found related to position increments seems to
> refer to Lucene 3 and things seem to be quite a bit different in 4. I think
> I've figured out what is going on, but would appreciate someone confirming
> if I'm right or not.
>
> It looks as if position increments can potentially occur in two places:
>
> 1. Between each token in a field. It looks like the
> PositionIncrementAttribute can be used to to pass a value to the tokenizer
> that is processing a field.
>
> 2. Between multiple instances of a given field within a document. It looks
> like the getPositionIncrementGap method on Analyzer can be overridden to set
> the position increment between each field instance.
>
> However, from looking from the source, it appears that nearly all the
> tokenizers ignore any values passed in a PositionIncrementAttribute and only
> use PositionIncrementAttribute to notify other parts of the processing chain
> of the value they actually used (normally 1). There's a filter to manipulate
> inter-token positions (PositionFilter), but the documentation says this:
>
> Deprecated.
> (4.4) PositionFilter makes TokenStream graphs inconsistent which can cause
> highlighting bugs.
>
> All of which makes it seem that manipulating the inter-token position
> increment isn't particularly useful.
>
> The second mechanism - overriding Analyzer.getPositionIncrementGap - does
> seem to work, but that obviously means putting each segment of the mbox file
> into a new field instance. Is that the preferred approach?
>
> Thanks,
>
> --
> Alan Burlison
> --
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to