Firstly, some context. I'm indexing a large set of mbox files which contain multiple email messages, each mbox file being related to a single issue. I'm therefore indexing each mbox as a single document, treating each individual mail as a section of the same document.

To control matching across mails I want to set the position increment. I'm trying to decide how best to do this - either by setting the increment between tokens within a single field or by using multiple instances of a field and setting the increment between each field instance.

Much of the information I've found related to position increments seems to refer to Lucene 3 and things seem to be quite a bit different in 4. I think I've figured out what is going on, but would appreciate someone confirming if I'm right or not.

It looks as if position increments can potentially occur in two places:

1. Between each token in a field. It looks like the PositionIncrementAttribute can be used to to pass a value to the tokenizer that is processing a field.

2. Between multiple instances of a given field within a document. It looks like the getPositionIncrementGap method on Analyzer can be overridden to set the position increment between each field instance.

However, from looking from the source, it appears that nearly all the tokenizers ignore any values passed in a PositionIncrementAttribute and only use PositionIncrementAttribute to notify other parts of the processing chain of the value they actually used (normally 1). There's a filter to manipulate inter-token positions (PositionFilter), but the documentation says this:

Deprecated.
(4.4) PositionFilter makes TokenStream graphs inconsistent which can cause highlighting bugs.

All of which makes it seem that manipulating the inter-token position increment isn't particularly useful.

The second mechanism - overriding Analyzer.getPositionIncrementGap - does seem to work, but that obviously means putting each segment of the mbox file into a new field instance. Is that the preferred approach?

Thanks,

--
Alan Burlison
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to