[jira] [Commented] (KUDU-3500) Don't start write operations timed out in the tablet's prepare queue
[ https://issues.apache.org/jira/browse/KUDU-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758449#comment-17758449 ] ASF subversion and git services commented on KUDU-3500: --- Commit 96535fcafe87f5b0466061061c779e8bff97179a in kudu's branch refs/heads/branch-1.17.x from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=96535fcaf ] KUDU-3500 don't start operations timed out in prepare queue While troubleshooting a performance issue where the prepare queue for a tablet was very long, I noticed that tablet servers start write operations that correspond to RPCs that have already timed out. Most likely, the client that sent the RPC had already detected the timeout and expected that the write would have failed already. As a simple optimization, this patch updates the logic of the OpDriver class to respond with TimedOut error status right away when a write operation that has already timed out while waiting in the prepare queue is dispatched to the prepare thread. That helps with clearing the queue and processing not-yet-timed-out requests from the queue faster, increasing the overall robustness of a tablet server when the load is high and the node's CPU and disk IO bandwidth are saturated. A new tablet metric 'ops_timed_out_in_prepare_queue' is introduced to track the number of WriteRequestPB RPCs timed out in the tablet's prepare queue and responded with TimedOut error status even before starting the PREPARE phase for the corresponding operation. This patch also adds a new test to cover the new functionality. Change-Id: I202ce6b5e425439b50c0751d7f7406e69b8e751a Reviewed-on: http://gerrit.cloudera.org:8080/20300 Tested-by: Kudu Jenkins Reviewed-by: Abhishek Chennaka (cherry picked from commit 6c049687f60e90cbdac6f6ec039a528d13664a6b) Reviewed-on: http://gerrit.cloudera.org:8080/20409 Reviewed-by: Yingchun Lai Tested-by: Yingchun Lai > Don't start write operations timed out in the tablet's prepare queue > > > Key: KUDU-3500 > URL: https://issues.apache.org/jira/browse/KUDU-3500 > Project: Kudu > Issue Type: Improvement > Components: tserver >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > While troubleshooting one performance issue where the prepare queue of a > tablet was very long, I noticed that tablet servers start write operations > that correspond to RPCs that have already timed out. Most likely, the client > that sent the RPC had already detected the timeout and expected that the > write would have failed already, so there isn't much sense to start such > operations anyway. > As a simple optimization, tablet servers shouldn't even start the PREPARE > phase for such operations, but respond with TimedOut error status right away > when dispatched them to the prepare thread. Doing so would help with > clearing the prepare queue and processing not-yet-timed-out requests from the > queue faster, increasing the overall robustness of a tablet server when the > load is high and the node's CPU and disk IO bandwidth are saturated. > A new metric should be introduced to track the number of WriteRequestPB RPCs > timed out in the prepare queue and responded with TimedOut error status > before starting the PREPARE phase for the corresponding operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3505) kudu ksck fails if healthy master in healthy cluster is started after the command
Bakai Ádám created KUDU-3505: Summary: kudu ksck fails if healthy master in healthy cluster is started after the command Key: KUDU-3505 URL: https://issues.apache.org/jira/browse/KUDU-3505 Project: Kudu Issue Type: Bug Components: master Reporter: Bakai Ádám Environment: single master configuration If the master is not running and the user starts a kudu cluster ksck command then it will try to connect to the master over and over again. Once the master is started, the ksck command is executed and it shows a bunch of errors: {code:java} adambakai@abakai-MBP16 d % kudu cluster ksck localhost:8764 -ksck_format plain_full Master Summary UUID | Address | Status --++- f41052a1ba8242d49ee5e16c0d60558a | localhost:8764 | HEALTHY All reported replicas are: A = f41052a1ba8242d49ee5e16c0d60558a The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ A | A* | 16 | -1 | YesFlags of checked categories for Master: Flag | Value | Master -+-+- builtin_ntp_servers | 0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | all 1 server(s) checked time_source | system_unsync | all 1 server(s) checkedTablet Server Summary UUID | Address | Status | Location | Tablet Leaders | Active Scanners --++-+--++- 1938796538bf483f9bcd133e29aa645b | 127.0.0.1:9878 | HEALTHY | | 0 | 0 8080a72aeb714c5087b8c515f21b1735 | 127.0.0.1:9870 | HEALTHY | | 1 | 0 9f86252d00814cb3ae0ef6858ee31a02 | 127.0.0.1:9874 | HEALTHY | | 0 | 0 c23de9c2b3e1448fa8dde2bb1a292388 | 127.0.0.1:9872 | HEALTHY | | 0 | 0 fb700997c9274a9d8287eb3c765606d2 | 127.0.0.1:9876 | HEALTHY | | 0 | 0Tablet Server Location Summary Location | Count --+- | 5Flags of checked categories for Tablet Server: Flag | Value | Tablet Server -+-+- builtin_ntp_servers | 0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | all 5 server(s) checked time_source | system_unsync | all 5 server(s) checkedVersion Summary Version | Servers -+- 1.18.0-SNAPSHOT | all 6 server(s) checkedTablet Summary Tablet 5d87f015c3a2438c8cec6e84796f9ecb of table 'db.test_table' is healthy. 8080a72aeb714c5087b8c515f21b1735 (127.0.0.1:9870): RUNNING [LEADER] c23de9c2b3e1448fa8dde2bb1a292388 (127.0.0.1:9872): RUNNING 9f86252d00814cb3ae0ef6858ee31a02 (127.0.0.1:9874): RUNNING All reported replicas are: A = 8080a72aeb714c5087b8c515f21b1735 B = c23de9c2b3e1448fa8dde2bb1a292388 C = 9f86252d00814cb3ae0ef6858ee31a02 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master | A* B C | | | Yes A | A* B C | 3 | -1 | Yes B | A* B C | 3 | -1 | Yes C | A* B C | 3 | -1 | YesThe cluster doesn't have any matching system tables Summary by table Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated | Unavailable ---++-+---+-++--+- db.test_table | 3 | HEALTHY | 1 | 1 | 0 | 0 | 0Tablet Replica Count Summary Statistic | Replica Count +--- Minimum | 0 First Quartile | 0 Median | 1 Third Quartile | 1 Maximum | 1Tablet Replica Count by Tablet Server UUID | Host | Replica Count --++--- 1938796538bf483f9bcd133e29aa645b | 127.0.0.1:9878 | 0 8080a72aeb714c5087b8c515f21b1735 | 127.0.0.1:9870 | 1 9f86252d00814cb3ae0ef6858ee31a02 | 127.0.0.1:9874 | 1 c23de9c2b3e1448fa8dde2bb1a292388 | 127.0.0.1:9872 | 1 fb700997c9274a9d8287eb3c765606d2 | 127.0.0.1:9876 | 0