[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978466#action_12978466
]
Michael Busch commented on LUCENE-2324:
---------------------------------------
{quote}
I believe we can drop the delete in that case. We only need to buffer into
DWPTs that have at least 1 doc.
{quote}
Yeah sounds right.
{quote}
If a given DWPT is flushing then we pick another? Ie the binding logic would
naturally avoid DWPTs that are not available - either because another thread
has it, or it's flushing. But it would prefer to use the same DWPT it used last
time, if possible (affinity).
{quote}
This is actually what should be happening currently if the (default)
ThreadAffinityThreadPool is used. I've to check the code again and maybe write
a test specifically for that.
bq. Also: I thought we don't have sequence IDs anymore? (At least, for landing
DWPT; after that (for "true RT") we need something like sequence IDs?).
True, sequenceIDs are gone since the last merge. And yes, I still think we'll
need them for RT. Even for the non-RT case sequenceIDs would have nice
benefits. If methods like addDocument(), deleteDocuments(), etc. return the
sequenceID they'd define a strict ordering on those operations and make it
transparent for the application, which would be beneficial for document
tracking and log replay.
But anyway, let's add seqIDs back after the DWPT changes are done and in trunk.
{quote}
bq. We shouldn't do global waiting anymore - this is what's great about DWPT.
However we'll have global waiting for the flush all threads case. I think that
can move down to DW though. Or should it simply be a sync in/on IW?
{quote}
True, the only global lock that locks all thread states happens when
flushAllThreads is called. This is called when IW explicitly triggers a flush,
e.g. on close/commit.
However, maybe this is not the right approach? I guess we don't really need
the global lock. A thread performing the "global flush" could still acquire
each thread state before it starts flushing, but return a threadState to the
pool once that particular threadState is done flushing?
A related question is: Do we want to piggyback on multiple threads when a
global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards
addDocument(). When should addDocument() happen?
a) After all DWPTs finished flushing?
b) After at least one DWPT finished flushing and is available again?
c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?
a) is currently implemented, but I think not really what we want.
b) is probably best for RT, because it means the lowest indexing latency for
the new document to be added.
c) probably means the best overall throughput (depending even on hardware like
disk speed, etc)
For whatever option we pick, we'll have to carefully think about error
handling. It's quite straightforward for a) (just commit all flushed segments
to SegmentInfos when the global flush completed succesfully). But for b) and
c) it's unclear what should happen if a DWPT flush fails after some completed
already successfully before.
> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch,
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch,
> lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]