[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

Shai Erera (JIRA) Fri, 07 May 2010 04:32:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865113#action_12865113
 ]


Shai Erera commented on LUCENE-1585:
------------------------------------

I hate it when it happens, but better sooner than later - I realized the API 
must take into account the current Term. We cannot process all the payloads in 
the index the same way. So how about the following:
* PayloadProcessorProvider will accept both a Directory and a Term, and will 
return a suitable PayloadProcessor for that Directory, and if needed, for the 
Directory+Term combination.
* PayloadProcessor will continue to work as is and will expose the same API - a 
payload is still a payload. Its the responsibility of PPP to return the right 
PP instance for the given Dir+Term
It does not make sense that the payloads of all the terms in the incoming 
indexes will need to be processed. Specifically, the scenario I have at hand 
needs to rewrite payloads of certain postings only, but the index contains 
payloads in other postings as well.

For 3x that's easy - SMI holds the current Term that is processed. But I don't 
see an equivalent in trunk, in PostingsConsumer. It receives a DocsEnum which 
does not tell you the term it works on, and MergeState which includes just 
FieldInfo, which can tell you the field name? Any ideas how I can get the Term 
this posting belongs to? (I know there is no Term, but field + BytesRef will 
do).

Mike - I'll add PP as a required arg to SM, np. I was only suggesting to pass 
IW so that we can avoid changing it in the future, but explicit args are fine 
by me.

> Allow to control how payloads are merged
> ----------------------------------------
>
>                 Key: LUCENE-1585
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1585
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1585_3x.patch, LUCENE-1585_3x.patch, 
> LUCENE-1585_trunk.patch
>
>
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1585) Allow to control how payloads are merged

Reply via email to