Re: incremental document field update

Babak Farhang Sun, 17 Jan 2010 01:34:29 -0800

Thanks Mike!  This is pretty cool..

So LUCENE-1879 takes care of aligning (syncing) doc-ids across
parallel index / segment merges. Missing is the machinery for
updating a field (or fields) in a parallel slave index: to do this the
appropriate segment in the slave index must somehow be rewritten.
(right?)


In order to support this feature, I'm thinking one could implement
an UpdateableSlaveIndexWriter / UpdateableSlaveIndexReader
pair. UpdateableSlaveIndexWriter is an IndexWriter with a special
updateField[s] method that uses a (I think, per-segment, not sure)
write-ahead log to map a to-be-deleted doc-id in to a newly created
doc-id.  It also maintains other statistics such as a virtual maxDocId.
The UpdateableSlaveIndexReader would play back this log file (if
any) and maintain the mappings in-memory.

I imagine there would also have to be a UpdateableSlaveIndexReader
variant that works on individual slave index segments rather than the
entire collection of segments that comprise the slave index. This, say,
UpdateableSlaveIndexSegReader would be used in lieu of the
SegmentReader instances currently used for merging segments.

The SegmentMerger class appears to be sufficiently agnostic about the
exact type of the IndexReader instances that represent the inputs to the
merge (though it does favor SegmentReader subclasses). So far so
good.  However, there appear to be a number of other junctures in the
merge code that depend on the static SegmentReader.get methods.
These would have to be indirected through some factory mechanism
so that we could return our UpdateableSlaveIndexSegReader.

Is the approach I'm describing sensible? Or did I get it all wrong? (This
was my first look at this code, so it is entirely possible that I'm making
a fool of myself ;)

-Babak




On Thu, Jan 14, 2010 at 3:39 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> Parallel incremental indexing
> (http://issues.apache.org/jira/browse/LUCENE-1879) is one way to solve
> this.
>
> Mike
>
> On Thu, Jan 14, 2010 at 4:27 AM, Babak Farhang <farh...@gmail.com> wrote:
>>> Reading that trail, I wish the original poster gave up on his idea (
>>
>> Err, that should have read..
>>
>> "Reading that trail, I wish the original poster hadn't given up on his idea"
>>
>>
>> On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang <farh...@gmail.com> wrote:
>>> Hi,
>>>
>>> I've been thinking about how to update a single field of a document
>>> without touching its other fields. This is an old problem and I was
>>> considering a solution along the lines of Andrzej Bialecki's post to
>>> the dev list back in '07:
>>>
>>>
>>> <quote  http://markmail.org/message/tbkgmnilhvrt6bii >
>>>
>>> I have the following scenario: I want to use ParallelReader to
>>> maintain parts of the index that are changing quickly, and where
>>> changes are limited to specific fields only.
>>>
>>> Let's say I have a "main" index (many fields, slowly changing, large
>>> updates), and an "aux" index (fast changing, usually single doc and
>>> single field updates). I'd like to "replace" documents in the "aux"
>>> index - that is, delete one doc and add another - but in a way that
>>> doesn't change the internal document numbers, so that I can keep the
>>> mapping required by ParallelReader intact.
>>>
>>> I think this is possible to achieve by using a FilterIndexReader,
>>> which keeps a map of updated documents, and re-maps old doc ids to the
>>> new ones on the fly.
>>>
>>> From time to time I'd like to optimize the "aux" index to get rid of
>>> deleted docs. At this time I need to figure out how to preserve the
>>> old->new mapping during the optimization.
>>>
>>> So, here's the question: is this scenario feasible? If so, then in the
>>> trunk/ version of Lucene, is there any way to figure out (predictably)
>>> how internal document numbers are reassigned after calling optimize()
>>> ?
>>>
>>> </quote>
>>>
>>>
>>> Reading that trail, I wish the original poster gave up on his idea (
>>> http://markmail.org/message/tbkgmnilhvrt6bii#query:+page:1+mid:kn77zpiu43kd2ufn+state:results
>>> )
>>>
>>>
>>> <quote>
>>> Thanks for the input - for now I gave up on this, after discovering
>>> that I would have no way to ensure in TermDocs.skipTo() that document
>>> id-s are monotonically increasing (which seems to be a part of the
>>> contract).
>>> </quote>
>>>
>>> I imagine if Andrzej's proposed FilterIndexReader maintains 2 sorted
>>> (ordered) maps, one from internal document-ids to "view" document-ids,
>>> and another mapping from  "view" document-ids to internal
>>> document-ids, then things like skipTo() can be implemented reasonably
>>> efficiently. Only the mapped ids are maintained in these structures.
>>> (Also note that a mapped "view" document-id represents an internally
>>> deleted document with that id.)
>>>
>>> And if we can find a way to merge the segments of this "aux" index
>>> along whenever the segments of its associated "main" index are merged
>>> or optimized (such that the [internal] doc-ids in the merged aux index
>>> end up getting sync'ed with those of the trunk), then there shouldn't
>>> be all that many doc-ids to map anyway (if we merge frequently
>>> enough).
>>>
>>> So to go back Andrzej's question: is there any way to figure out
>>> (predictably) how internal document numbers [in the main index] are
>>> reassigned after calling optimize() ? How does LUCENE-847, as Doug
>>> Cutting suggests in that trail, help?
>>>
>>> Sorry if that was long winded, had to start somewhere ;)
>>>
>>> -Babak
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: incremental document field update

Reply via email to