[
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865113#action_12865113
]
Shai Erera commented on LUCENE-1585:
------------------------------------
I hate it when it happens, but better sooner than later - I realized the API
must take into account the current Term. We cannot process all the payloads in
the index the same way. So how about the following:
* PayloadProcessorProvider will accept both a Directory and a Term, and will
return a suitable PayloadProcessor for that Directory, and if needed, for the
Directory+Term combination.
* PayloadProcessor will continue to work as is and will expose the same API - a
payload is still a payload. Its the responsibility of PPP to return the right
PP instance for the given Dir+Term
It does not make sense that the payloads of all the terms in the incoming
indexes will need to be processed. Specifically, the scenario I have at hand
needs to rewrite payloads of certain postings only, but the index contains
payloads in other postings as well.
For 3x that's easy - SMI holds the current Term that is processed. But I don't
see an equivalent in trunk, in PostingsConsumer. It receives a DocsEnum which
does not tell you the term it works on, and MergeState which includes just
FieldInfo, which can tell you the field name? Any ideas how I can get the Term
this posting belongs to? (I know there is no Term, but field + BytesRef will
do).
Mike - I'll add PP as a required arg to SM, np. I was only suggesting to pass
IW so that we can avoid changing it in the future, but explicit args are fine
by me.
> Allow to control how payloads are merged
> ----------------------------------------
>
> Key: LUCENE-1585
> URL: https://issues.apache.org/jira/browse/LUCENE-1585
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Shai Erera
> Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch,
> LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging.
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]