[
https://issues.apache.org/jira/browse/LUCENE-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-2529:
---------------------------------
Attachment: LUCENE-2529_skip_posIncr_for_1st_token.patch
(patch updated)
bq. Maybe, instead of that +1 inside IW, we change the default posIncrGap to 1?
I had the +1 for the gap (i.e. between values) level because I was trying to
get a blank value (or a value consisting of stop words) to bump the position
counter as well. I've been tinkering with this a bit more and I realize now
that I can still achieve my aims without doing that, but it's still necessary
to ignore the very first position increment of the very first value -- only.
See the new patch. I think the result now should be even more amenable to
others (i.e. is least disruptive) since anyone messing with the position
increment of the first token of subsequent values will still be honored.
bq. Can you spell out examples of how the indexed positions will change w/ this
patch - I'm having trouble visualizing this. EG for a single valued field,
multi-valued, etc.
A single valued field is unaffected. The first emitted token (if there are any
at all) will remain at position 0 no matter what the analyzer does. This is
also true for the first value of a multi-valued field if there is any.
For multi-valued fields, it is now always the case that the first token of
subsequent values (e.g. not the first value) will be the previous position (0
if none) + the gap + the first position increment of this value (typically 1).
This is consistent and sensible. Formerly,
if the first value was a blank value (or a value consisting of stop words),
then you'd get 1 less than what you get now. I hope the test I modified as
part of this patch makes this more clear; I had to increment the tested
positions by 1.
As I said before, I also think that the code is more clear since it no longer
has that conditional pre-decrement and post increment of the position that was
probably only understood by you. And I did away with the weird "+1" at the gap
in my previous patch.
bq. Man I really want to get this logic out of indexer and into the analysis
chain (LUCENE-2450 enables this). How multi-valued streams should handle the
transition from one value to another shouldn't be inside the indexer... and
maybe (someday) tokens should store their position (not the gap) so we don't
have this cryptic logic inside the indexer..
That sounds great. There are other strategies of messing with position
increments that I simply can't do without hacking this code further. For
example, it would be neat if the first token of a value could be devised to
start at posIncGap*valueIndex (ex: 0, 1000, 2000, ...) so that Span queries
could determine which value index a term matched against by looking at it's
position (ex: 3092: divide by 1000, drop remainder, add 1: the 4th value ).
> always apply position increment gap between values
> --------------------------------------------------
>
> Key: LUCENE-2529
> URL: https://issues.apache.org/jira/browse/LUCENE-2529
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
> Environment: (I don't know which version to say this affects since
> it's some quasi trunk release and the new versioning scheme confuses me.)
> Reporter: David Smiley
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments:
> LUCENE-2529_always_apply_position_increment_gap_between_values.patch,
> LUCENE-2529_skip_posIncr_for_1st_token.patch,
> LUCENE-2529_skip_posIncr_for_1st_token.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> I'm doing some fancy stuff with span queries that is very sensitive to term
> positions. I discovered that the position increment gap on indexing is only
> applied between values when there are existing terms indexed for the
> document. I suspect this logic wasn't deliberate, it's just how its always
> been for no particular reason. I think it should always apply the gap
> between fields. Reference DocInverterPerField.java line 82:
> if (fieldState.length > 0)
> fieldState.position +=
> docState.analyzer.getPositionIncrementGap(fieldInfo.name);
> This is checking fieldState.length. I think the condition should simply be:
> if (i > 0).
> I don't think this change will affect anyone at all but it will certainly
> help me. Presently, I can either change this line in Lucene, or I can put in
> a hack so that the first value for the document is some dummy value which is
> wasteful.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]