[jira] Commented: (LUCENE-2755) Some improvements to CMS

Earwin Burrfoot (JIRA) Wed, 24 Nov 2010 09:09:36 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935400#action_12935400
 ]


Earwin Burrfoot commented on LUCENE-2755:
-----------------------------------------

bq. Refactor IW, MS and MP so that MS pulls merges directly from MP, instead 
from IW.
Directly or through IW - this is not important. Important point is pulling 
merges one-by-one, when you have the resources to execute them.

bq. Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the 
threads on its own, in addition to using a blocking queue instead of us coding 
the blocking directly.
bq. Using ThreadPoolExecutor looks like will only complicate CMS instead of 
simplifying it:
I ended up with same conclusion, while taking my first stabs. But for different 
reasons.
The philosphy of Executors is that you schedule (push) a number of tasks, and 
then some magic black box runs them for you, resolving threading issues itself.
My suggestion requires pulling tasks when computing resources become available, 
and that doesn't map on scheduling model at all.
All priority/pausing/breaking issues are largely irrelevant.

bq. MergeThreads' priority needs to be controllable, and we need the ability to 
pause large merges in favor of small ones
These, and the likes - are not requirements.
These are but one of the possible solutions to our real requirements, which 
look like
* don't run out of file handles on fast indexation
* don't degrade search performance and NRT turnaround
* don't kill the disk with too much random IOs.

bq. If there are cascading merges (i.e., a result of several other merges), 
they should all be executed following the call to MS.merge() - that is, it 
could be that CMS itself, or its MergeThreads will encounter merges not 
returned by MP at first, but as a subsequent round due to changes done to the 
index.
This is trivially solved with my pulling model. We pull until nothing is left. 
Period. Instead of getting batches of merges from MP and then reconciling them 
with reality we do the same operation over and over again, until MP is 
satisfied - very simple.

bq. The proposal will add a getNextMerge() to MP, instead of IW, which IMO will 
only complicate matters for MP implementers. E.g., what should MP do if 
findRegularMerges was called, then getNext() was called and then 
findOptimizeMerges is called? It's not a critical decision we leave in the MP 
developers, but IMO it's unnecessary. Today MP is a stateless object - it 
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' 
anything. But if we move the getNextMerge() to it, we make it stateful, for no 
good reasons
bq. We don't really take IW outside the loop really - it would still need to 
instruct MP which merges to 'prepare', so that MS can take.
There will be, most probably, getNext(Normal/Optimize/Expunge)Merge() methods. 
findWhatever methods will be removed, noone needs to call them, so - no state, 
no 'preparations'.
MP will recieve SegmentInfos and return OneMerge.

bq. To allow for MP dependent sort, I suggest we add to MP a 
getMergesComparator and use it in CMS.
MP should return merges sorted, that's all. Why do you need to expose its 
Comparator or whatever it uses for sorting?


Whatever I didn't mention from your post - I either missed, or agree with :)
I think I'll stop trying to explain it in Jira comments. It took great time 
discussing everything with Mike over IRC, and here it'll take ages.
The proper route is to take a handful of dirt and sticks and slap together some 
working code to illustrate my point. And that's what I'm gonna do.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got 
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
> MergeThreads taking merges from the IndexWriter until they are exhausted, and 
> only then that blocked merge will run. I think it's unnecessary that that 
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
> size segments anymore, I think we should switch the default impl. There are 
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override 
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to 
> read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2755) Some improvements to CMS

Reply via email to