[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

Michael McCandless (JIRA) Sat, 08 Jan 2011 06:01:35 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979129#action_12979129
 ]


Michael McCandless commented on LUCENE-2324:
--------------------------------------------

bq. I guess we don't really need the global lock. A thread performing the 
"global flush" could still acquire each thread state before it starts flushing, 
but return a threadState to the pool once that particular threadState is done 
flushing?

Good question... we could (in theory) also flush them concurrently?  But, since 
we don't "own" the threads in IW, we can't easily do that, so I think no global 
lock, go through all DWPTs w/ current thread and flush, sequentially?  So all 
that's guaranteed after the global flush() returns is that all state present 
prior to when flush() is invoked, is moved to disk.  Ie if addDocs are still 
happening concurrently then the DWPTs will start filling up again even while 
the "global flush" runs.  That's fine.

{quote}

A related question is: Do we want to piggyback on multiple threads when a 
global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards 
addDocument(). When should addDocument() happen? 
a) After all DWPTs finished flushing? 
b) After at least one DWPT finished flushing and is available again?
c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

a) is currently implemented, but I think not really what we want.
b) is probably best for RT, because it means the lowest indexing latency for 
the new document to be added.
c) probably means the best overall throughput (depending even on hardware like 
disk speed, etc)
{quote}

I think start simple -- the addDocument always happens?  Ie it's never 
coordinated w/ the ongoing flush.  It picks a free DWPT like normal, and since 
flush is single threaded, there should always be a free DWPT?

Longer term c) would be great, or, if IW has an ES then it'd send multiple 
flush jobs to the ES.

{quote}
For whatever option we pick, we'll have to carefully think about error 
handling. It's quite straightforward for a) (just commit all flushed segments 
to SegmentInfos when the global flush completed succesfully). But for b) and c) 
it's unclear what should happen if a DWPT flush fails after some completed 
already successfully before.
{quote}

I think we should continue what we do today?  Ie, if it's an 'aborting' 
exception, then the entire segment held by that DWPT is discarded?  And we then 
throw this exc back to caller (and don't try to flush any other segments)?

> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2324
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

Reply via email to