Yeah, sounds like we have the same things in mind here. In fact, this is pretty similar to what we discussed a while ago on LUCENE-2026 I think.

SegmentWriter could be a higher level interface with more than one implementation. E.g. there could be one SegmentWriter that supports appending documents (i.e. the DocumentsWriter today) and also one that allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does today. Often when you rewrite entire parallel slices you don't want to use addDocument(). E.g. when you read from a source slice, modify some data and write a new version of that slice it can be dramatically faster to write postinglist after postinglist, because you avoid parallel I/O and a lot of seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual numbers from an implementation I had at IBM...)

Further, I imagine to utilize the slice concept within Lucene. The store could be a separate slice, and so could be the norms and the new flexible scoring data structures. It's then super easy to turn those off or rewrite them individually (see LUCENE-2025). Often parallel indexes don't need a store or norms, so this slice concept makes total sense in my opinion. Norms actually works like this already, you can rewrite them which bumps up their generation number. We just have to make this concept more abstract, so that it can be used for any kind of slice. Many people have also asked about allowing Lucene to manage external data structures. I think these changes would allow exactly that: just implement your external data structure as a slice, and Lucene will call your code when merging, deletions, adds happen. Cool! :)

@Shai: If we implement Parallel indexing outside of Lucene's core then we have some of the same drawbacks as with the current master-slave approach. I'm especially worried about how that would work then with realtime indexing (both searchable RAM buffer and also NRT). I think PI must be completely segment-aware. Then it should fit very nicely into realtime indexing, which is also very cool!

 Michael


On 4/21/10 8:06 AM, Michael McCandless wrote:
I do think the idea of an abstract class (or interface) SegmentWriter
is compelling.

Each DWPT would be a [single-threaded] SegmentWriter.

And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
collection of SegmentWriters, deleting to them, aggregating RAM used
across all, manages picking which ones to flush, etc.).

Then, a SlicedSegmentWriter (say) would write to separate slices,
single threaded, and then you could make it multi-threaded by wrapping
w/ the above class.

Though SegmentWriter isn't a great name since it would in general
write to multiple segments.  Indexer is a little too broad though :)

Something like that maybe?

Also, allowing an app to directly control the underlying
SegmentWriters inside IndexWriter (instead of letting the
multi-threaded wrapper decide for you) is compelling for way advanced
apps, I think.  EG your app may know it's done indexing from source A
for a while, so, you should right now go and flush it (whereas the
default "flush the one using the most RAM" could leave that source
unflushed for a quite a while, tying up RAM, unless we do some kind of
LRU flushing policy or something).

Mike

On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<[email protected]>  wrote:
I'm not sure that a Parallel DW would work for PI because DW is too internal
to IW. Currently, the approach I've been thinking about for PI is to tackle
it from a high level, e.g. allow the application to pass a Directory, or
even an IW instance, and PI will play the coordinator role, ensuring that
merge of segments happens across all the slices in accordance, implementing
two-phase operations etc. A Parallel DW then does not fit nicely w/ that
approach (unless we want to refactor how IW works completely) because DW is
not aware of the Directory, and if PI indeed works over IW instances, then
each will have its own DW.

So there are two basic approaches we can take for PI (following current
architecture) - either let PI manage IW, or have PI a sort of IW itself,
which handles events at a much lower level. While the latter is more robust
(and based on current limitations I'm running into, might be even easier to
do), it lacks the flexibility of allowing the app to plug any IW it wants.
That requirement is also important, if the application wants to use PI in
scenarios where it keeps some slices in RAM and some on disk, or it wants to
control more closely which fields go to which slice, so that it can at some
point in time "rebuild" a certain slice outside PI and replace the existing
slice in PI w/ the new one ...

We should probably continue the discussion on PI, so I suggest we either
move it to another thread or on the issue directly.

Mike - I agree w/ you that we should keep the life of the application
developers easy and that having IW itself support concurrency is beneficial.
Like I said ... it was just a thought which was aimed at keeping our life
(Lucene developers) easier, but that probably comes second compared to
app-devs life :). I'm not at all sure also that that would have make our
life easier ...

So I'm good if you want to drop the discussion.

Shai

On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<[email protected]>  wrote:
On 4/19/10 10:25 PM, Shai Erera wrote:
It will definitely simplify multi-threaded handling for IW extensions
like Parallel Index …

I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
like to introduce parallel DWPTs, that write different slices.
  Synchronization should not be a big worry then, because writing is
single-threaded.

We could introduce a new abstract class SegmentWriter, which DWPT would
implement.  An extension would be ParallelSegmentWriter, which would manage
multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
name.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to