[
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935400#action_12935400
]
Earwin Burrfoot commented on LUCENE-2755:
-----------------------------------------
bq. Refactor IW, MS and MP so that MS pulls merges directly from MP, instead
from IW.
Directly or through IW - this is not important. Important point is pulling
merges one-by-one, when you have the resources to execute them.
bq. Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the
threads on its own, in addition to using a blocking queue instead of us coding
the blocking directly.
bq. Using ThreadPoolExecutor looks like will only complicate CMS instead of
simplifying it:
I ended up with same conclusion, while taking my first stabs. But for different
reasons.
The philosphy of Executors is that you schedule (push) a number of tasks, and
then some magic black box runs them for you, resolving threading issues itself.
My suggestion requires pulling tasks when computing resources become available,
and that doesn't map on scheduling model at all.
All priority/pausing/breaking issues are largely irrelevant.
bq. MergeThreads' priority needs to be controllable, and we need the ability to
pause large merges in favor of small ones
These, and the likes - are not requirements.
These are but one of the possible solutions to our real requirements, which
look like
* don't run out of file handles on fast indexation
* don't degrade search performance and NRT turnaround
* don't kill the disk with too much random IOs.
bq. If there are cascading merges (i.e., a result of several other merges),
they should all be executed following the call to MS.merge() - that is, it
could be that CMS itself, or its MergeThreads will encounter merges not
returned by MP at first, but as a subsequent round due to changes done to the
index.
This is trivially solved with my pulling model. We pull until nothing is left.
Period. Instead of getting batches of merges from MP and then reconciling them
with reality we do the same operation over and over again, until MP is
satisfied - very simple.
bq. The proposal will add a getNextMerge() to MP, instead of IW, which IMO will
only complicate matters for MP implementers. E.g., what should MP do if
findRegularMerges was called, then getNext() was called and then
findOptimizeMerges is called? It's not a critical decision we leave in the MP
developers, but IMO it's unnecessary. Today MP is a stateless object - it
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember'
anything. But if we move the getNextMerge() to it, we make it stateful, for no
good reasons
bq. We don't really take IW outside the loop really - it would still need to
instruct MP which merges to 'prepare', so that MS can take.
There will be, most probably, getNext(Normal/Optimize/Expunge)Merge() methods.
findWhatever methods will be removed, noone needs to call them, so - no state,
no 'preparations'.
MP will recieve SegmentInfos and return OneMerge.
bq. To allow for MP dependent sort, I suggest we add to MP a
getMergesComparator and use it in CMS.
MP should return merges sorted, that's all. Why do you need to expose its
Comparator or whatever it uses for sorting?
Whatever I didn't mention from your post - I either missed, or agree with :)
I think I'll stop trying to explain it in Jira comments. It took great time
discussing everything with Mike over IRC, and here it'll take ages.
The proper route is to take a handful of dirt and sticks and slap together some
working code to illustrate my point. And that's what I'm gonna do.
> Some improvements to CMS
> ------------------------
>
> Key: LUCENE-2755
> URL: https://issues.apache.org/jira/browse/LUCENE-2755
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Shai Erera
> Assignee: Shai Erera
> Priority: Minor
> Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the
> MergeThreads taking merges from the IndexWriter until they are exhausted, and
> only then that blocked merge will run. I think it's unnecessary that that
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based
> size segments anymore, I think we should switch the default impl. There are
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to
> read and follow.
> I'll work on a patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]