[ https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin updated KUDU-3195: -------------------------------- Code Review: https://gerrit.cloudera.org/#/c/16581/ > Make DMS flush policy more robust when maintenance threads are idle > ------------------------------------------------------------------- > > Key: KUDU-3195 > URL: https://issues.apache.org/jira/browse/KUDU-3195 > Project: Kudu > Issue Type: Improvement > Components: tserver > Affects Versions: 1.13.0 > Reporter: Alexey Serbin > Priority: Major > > In one scenario I observed very long bootstrap times of tablet servers > (something between to 45 minutes and 60 minutes) even if tablet servers had > relatively small amount of data under management (~80GByte). It turned out > the time was spent on replaying WAL segments, with {{kudu cluster ksck}} > reporting something like below all the time during bootstrap: > {noformat} > b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running > State: BOOTSTRAPPING > Data state: TABLET_DATA_READY > Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this > segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} > inserts{seen=5949247 > ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7) > {noformat} > The workload I ran before shutting down the tablet servers consisted of many > small UPSERT operations, but the cluster was idle after terminating the > workload for long time (about few hours or so). The workload was generated by > {noformat} > kudu perf loadgen \ > --table_name=$TABLE_NAME \ > --num_rows_per_thread=800000000 \ > --num_threads=4 \ > --use_upsert \ > --use_random_pk \ > $MASTER_ADDR > {noformat} > The table that the UPSERT workload was running against had been pre-populated > by the following: > {noformat} > kudu perf loadgen --table_num_replicas=3 --keep-auto-table > --table_num_hash_partitions=5 --table_num_range_partitions=5 > --num_rows_per_thread=800000000 --num_threads=4 $MASTER_ADDR > {noformat} > As it turned out, tablet servers accumulated huge number of DMS which > required flushing/compaction, but after the memory pressure subsided, the > compaction policy was scheduling just one operation per tablet in every 120 > seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}). > In fact, tablet servers could flush those rowsets non-stop since the > maintenance threads were completely idle otherwise and there were no active > workload running against the cluster. Those DMS has been around for long > time (much more than 120 seconds) and were anchoring a lot of WAL segments. > So, the operations from the WAL had to be replayed once I restarted the > tablet servers. > It would be great to update the flushing/compaction policy to allow tablet > servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than > specified by {{\-\-flush_threshold_secs}} when the maintenance threads are > not busy otherwise. -- This message was sent by Atlassian Jira (v8.3.4#803005)