[ https://issues.apache.org/jira/browse/KUDU-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265069#comment-17265069 ]
Alexey Serbin edited comment on KUDU-1954 at 1/14/21, 6:26 PM: --------------------------------------------------------------- One related patch to fix lock contention in MM: https://github.com/apache/kudu/commit/9e4664d44ca994484d79d970e7c7e929d0dba055 In essence, threads from the maintenance thread pool were blocked by the scheduler thread when they are done performing their assigned operations, and the scheduler thread can keep them blocked for long time while choosing the best candidate among all registered maintenance operations to perform. I saw a system where a histogram for "flush_mrs_duration" looked like below (the duration values are in milliseconds as per https://github.com/apache/kudu/blob/2880d3a0dc0fc07d20fab79f1d0df3a6a622e398/src/kudu/tablet/tablet_metrics.cc#L294-L299) {noformat} { "name": "flush_mrs_duration", "total_count": 32, "min": 1548, "mean": 35952.2, "percentile_75": 24576, "percentile_95": 59392, "percentile_99": 59392, "percentile_99_9": 59392, "percentile_99_99": 59392, "max": 402120, "total_sum": 1150469 } {noformat} In other words, over 30 seconds in average and more than 6 minutes maximum to flush a MRS. The tail percentile numbers for the "flush_mrs_duration" metric for other tablet replicas looked very similar. Due to the way how the duration for MM operations was calculated prior to the patch, the duration included the time a maintenance thread spent waiting on the lock held by the scheduler while running {{FindBestOp()}}. That explains almost identical tail distribution of metric values across all tablet replicas. The scheduler thread runs computation-heavy {{FindBestOp()}} method with computational complexity O(n^2) of the number of replicas hosted by a tablet server (each tablet replica registers about 8 maintenance operations). For certain type of maintenance operations {{FindBestOp()}} calls {{BudgetedCompactionPolicy::RunApproximation()}} with computational complexity O(n^2) of the number of rowset in max and min keys. And as already mentioned, the scheduler starts running {{FindBestOp()}} only when it finds that there is at least one free maintenance thread. Because of this and due to O(n^2) complexity, it can spiral down in case of too many replicas per tablet server. With that, there is a way for a tablet server to fall behind the rate of data ingestion just because it takes too long to schedule next maintenance operation, not because it takes long time to perform them. was (Author: aserbin): One related patch to fix lock contention in MM: https://github.com/apache/kudu/commit/9e4664d44ca994484d79d970e7c7e929d0dba055 > Improve maintenance manager behavior in heavy write workload > ------------------------------------------------------------ > > Key: KUDU-1954 > URL: https://issues.apache.org/jira/browse/KUDU-1954 > Project: Kudu > Issue Type: Improvement > Components: compaction, perf, tserver > Affects Versions: 1.3.0 > Reporter: Todd Lipcon > Priority: Major > Labels: performance, roadmap-candidate, scalability > Attachments: mm-trace.png > > > During the investigation in [this > doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit] > I found a few maintenance-manager-related issues during heavy writes: > - we don't schedule flushes until we are already in "backpressure" realm, so > we spent most of our time doing backpressure > - even if we configure N maintenance threads, we typically are only using > ~50% of those threads due to the scheduling granularity > - when we do hit the "memory-pressure flush" threshold, all threads quickly > switch to flushing, which then brings us far beneath the threshold > - long running compactions can temporarily starve flushes > - high volume of writes can starve compactions -- This message was sent by Atlassian Jira (v8.3.4#803005)