[ 
https://issues.apache.org/jira/browse/KUDU-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265069#comment-17265069
 ] 

Alexey Serbin edited comment on KUDU-1954 at 1/14/21, 6:26 PM:
---------------------------------------------------------------

One related patch to fix lock contention in MM: 
https://github.com/apache/kudu/commit/9e4664d44ca994484d79d970e7c7e929d0dba055

In essence, threads from the maintenance thread pool were blocked by the 
scheduler thread when they are done performing their assigned operations, and 
the scheduler thread can keep them blocked for long time while choosing the 
best candidate among all registered maintenance operations to perform.

I saw a system where a histogram for "flush_mrs_duration" looked like below 
(the duration values are in milliseconds as per 
https://github.com/apache/kudu/blob/2880d3a0dc0fc07d20fab79f1d0df3a6a622e398/src/kudu/tablet/tablet_metrics.cc#L294-L299)
 
{noformat}
{                                                                               
    "name": "flush_mrs_duration",                                               
    "total_count": 32,                                                          
    "min": 1548,                                                                
    "mean": 35952.2,                                                            
    "percentile_75": 24576,                                                     
    "percentile_95": 59392,                                                     
    "percentile_99": 59392,                                                     
    "percentile_99_9": 59392,                                                   
    "percentile_99_99": 59392,                                                  
    "max": 402120,                                                              
    "total_sum": 1150469                                                        
}
{noformat}

In other words, over 30 seconds in average and more than 6 minutes maximum to 
flush a MRS.

The tail percentile numbers for the "flush_mrs_duration" metric for other 
tablet replicas looked very similar.  Due to the way how the duration for MM 
operations was calculated prior to the patch, the duration included the time a 
maintenance thread spent waiting on the lock held by the scheduler while 
running {{FindBestOp()}}.  That explains almost identical tail distribution of 
metric values across all tablet replicas.

The scheduler thread runs computation-heavy {{FindBestOp()}} method with 
computational complexity O(n^2) of the number of replicas hosted by a tablet 
server (each tablet replica registers about 8 maintenance operations).  For 
certain type of maintenance operations {{FindBestOp()}} calls 
{{BudgetedCompactionPolicy::RunApproximation()}} with computational complexity 
O(n^2) of the number of rowset in max and min keys.  And as already mentioned, 
the scheduler starts running {{FindBestOp()}} only when it finds that there is 
at least one free maintenance thread.  Because of this and due to O(n^2) 
complexity, it can spiral down in case of too many replicas per tablet server.  
With that, there is a way for a tablet server to fall behind the rate of data 
ingestion just because it takes too long to schedule next maintenance 
operation, not because it takes long time to perform them.


was (Author: aserbin):
One related patch to fix lock contention in MM: 
https://github.com/apache/kudu/commit/9e4664d44ca994484d79d970e7c7e929d0dba055

> Improve maintenance manager behavior in heavy write workload
> ------------------------------------------------------------
>
>                 Key: KUDU-1954
>                 URL: https://issues.apache.org/jira/browse/KUDU-1954
>             Project: Kudu
>          Issue Type: Improvement
>          Components: compaction, perf, tserver
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>            Priority: Major
>              Labels: performance, roadmap-candidate, scalability
>         Attachments: mm-trace.png
>
>
> During the investigation in [this 
> doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit]
>  I found a few maintenance-manager-related issues during heavy writes:
> - we don't schedule flushes until we are already in "backpressure" realm, so 
> we spent most of our time doing backpressure
> - even if we configure N maintenance threads, we typically are only using 
> ~50% of those threads due to the scheduling granularity
> - when we do hit the "memory-pressure flush" threshold, all threads quickly 
> switch to flushing, which then brings us far beneath the threshold
> - long running compactions can temporarily starve flushes
> - high volume of writes can starve compactions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to