[ 
https://issues.apache.org/jira/browse/LUCENE-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327139#comment-14327139
 ] 

Dawid Weiss commented on LUCENE-6119:
-------------------------------------

I realize this issue is closed but it'd be sweet to have this kind of adaptive 
heuristic to throttle the number of merging *threads* as well. Let me explain.

When you think of it, the really important measurement of quality on the 
surface is I/O throughput of merges combined with I/O throughput of IW 
additions (indexing). Essentially we want to maximize a function:
{code}
f = merge_throughput + indexing_throughput
{code}
perhaps with a bias towards indexing_throughput which can be modeled (by 
multiplying by a constant?). The underlying variables to adaptively tune are:

- how many merge threads there are (for example having too many doesn't make 
sense on a spindle, with an SSD this is not a problem),
- when to pause/ resume existing merge threads,
- when to pause/ resume indexing threads.

What's interesting is that we can tweak these variables in response to the the 
current value (and gradient) of function f. This means an adaptive algorithm 
could (examples):

- react to temporary external system load (for example pausing some merge 
threads if it observes a drop in throughput),

- find out the sweet spot of how many merge threads there can be without 
saturating I/O (no need to detect SSD vs. spindle; we just want to maximize f 
-- the optimal number of merge threads would emerge by itself from looking at 
the data).

Now the big question is what this algorithm should look like, of course. The 
options vary from relatively simple hand-written rule-based heuristics to an 
advanced black-box with either pre-trained or adaptive machine learning 
algorithms. 

I have an application that has just one of the objectives of function f (we 
need to quickly merge a large set of segments, optimally without knowing or 
caring what the underlying disk hardware/ disk buffers are). I'll report my 
impressions once I have it done.

> Add auto-io-throttle to ConcurrentMergeScheduler
> ------------------------------------------------
>
>                 Key: LUCENE-6119
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6119
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-6119.patch, LUCENE-6119.patch, LUCENE-6119.patch, 
> LUCENE-6119.patch, LUCENE-6119.patch, LUCENE-6119.patch
>
>
> This method returns number of "incoming" bytes IW has written since it
> was opened, excluding merging.
> It tracks flushed segments, new commits (segments_N), incoming
> files/segments by addIndexes, newly written live docs / doc values
> updates files.
> It's an easy statistic for IW to track and should be useful to help
> applications more intelligently set defaults for IO throttling
> (RateLimiter).
> For example, an application that does hardly any indexing but finally
> triggered a large merge can afford to heavily throttle that large
> merge so it won't interfere with ongoing searches.
> But an application that's causing IW to write new bytes at 50 MB/sec
> must set a correspondingly higher IO throttling otherwise merges will
> clearly fall behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to