Firstly, some context. I'm indexing a large set of mbox files which
contain multiple email messages, each mbox file being related to a
single issue. I'm therefore indexing each mbox as a single document,
treating each individual mail as a section of the same document.
To control matching across mails I want to set the position increment.
I'm trying to decide how best to do this - either by setting the
increment between tokens within a single field or by using multiple
instances of a field and setting the increment between each field instance.
Much of the information I've found related to position increments seems
to refer to Lucene 3 and things seem to be quite a bit different in 4. I
think I've figured out what is going on, but would appreciate someone
confirming if I'm right or not.
It looks as if position increments can potentially occur in two places:
1. Between each token in a field. It looks like the
PositionIncrementAttribute can be used to to pass a value to the
tokenizer that is processing a field.
2. Between multiple instances of a given field within a document. It
looks like the getPositionIncrementGap method on Analyzer can be
overridden to set the position increment between each field instance.
However, from looking from the source, it appears that nearly all the
tokenizers ignore any values passed in a PositionIncrementAttribute and
only use PositionIncrementAttribute to notify other parts of the
processing chain of the value they actually used (normally 1). There's a
filter to manipulate inter-token positions (PositionFilter), but the
documentation says this:
Deprecated.
(4.4) PositionFilter makes TokenStream graphs inconsistent which can
cause highlighting bugs.
All of which makes it seem that manipulating the inter-token position
increment isn't particularly useful.
The second mechanism - overriding Analyzer.getPositionIncrementGap -
does seem to work, but that obviously means putting each segment of the
mbox file into a new field instance. Is that the preferred approach?
Thanks,
--
Alan Burlison
--
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org