[
https://issues.apache.org/jira/browse/LUCENE-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327139#comment-14327139
]
Dawid Weiss commented on LUCENE-6119:
-------------------------------------
I realize this issue is closed but it'd be sweet to have this kind of adaptive
heuristic to throttle the number of merging *threads* as well. Let me explain.
When you think of it, the really important measurement of quality on the
surface is I/O throughput of merges combined with I/O throughput of IW
additions (indexing). Essentially we want to maximize a function:
{code}
f = merge_throughput + indexing_throughput
{code}
perhaps with a bias towards indexing_throughput which can be modeled (by
multiplying by a constant?). The underlying variables to adaptively tune are:
- how many merge threads there are (for example having too many doesn't make
sense on a spindle, with an SSD this is not a problem),
- when to pause/ resume existing merge threads,
- when to pause/ resume indexing threads.
What's interesting is that we can tweak these variables in response to the the
current value (and gradient) of function f. This means an adaptive algorithm
could (examples):
- react to temporary external system load (for example pausing some merge
threads if it observes a drop in throughput),
- find out the sweet spot of how many merge threads there can be without
saturating I/O (no need to detect SSD vs. spindle; we just want to maximize f
-- the optimal number of merge threads would emerge by itself from looking at
the data).
Now the big question is what this algorithm should look like, of course. The
options vary from relatively simple hand-written rule-based heuristics to an
advanced black-box with either pre-trained or adaptive machine learning
algorithms.
I have an application that has just one of the objectives of function f (we
need to quickly merge a large set of segments, optimally without knowing or
caring what the underlying disk hardware/ disk buffers are). I'll report my
impressions once I have it done.
> Add auto-io-throttle to ConcurrentMergeScheduler
> ------------------------------------------------
>
> Key: LUCENE-6119
> URL: https://issues.apache.org/jira/browse/LUCENE-6119
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-6119.patch, LUCENE-6119.patch, LUCENE-6119.patch,
> LUCENE-6119.patch, LUCENE-6119.patch, LUCENE-6119.patch
>
>
> This method returns number of "incoming" bytes IW has written since it
> was opened, excluding merging.
> It tracks flushed segments, new commits (segments_N), incoming
> files/segments by addIndexes, newly written live docs / doc values
> updates files.
> It's an easy statistic for IW to track and should be useful to help
> applications more intelligently set defaults for IO throttling
> (RateLimiter).
> For example, an application that does hardly any indexing but finally
> triggered a large merge can afford to heavily throttle that large
> merge so it won't interfere with ongoing searches.
> But an application that's causing IW to write new bytes at 50 MB/sec
> must set a correspondingly higher IO throttling otherwise merges will
> clearly fall behind.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]