[ https://issues.apache.org/jira/browse/KUDU-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17917000#comment-17917000 ]
ASF subversion and git services commented on KUDU-3641: ------------------------------------------------------- Commit 0c47a46e41235020337984a6053d3b7e3964092b in kudu's branch refs/heads/branch-1.18.x from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0c47a46e4 ] KUDU-3641 fix flaky TestNewLeaderCantResolvePeers I noticed that RaftConsensusElectionITest.TestNewLeaderCantResolvePeers scenario was failing from time to time in pre-commit tests, and the same issue was also exposed by the flaky tests dashboard [1]. The scenario would usually succeed because in most cases the system catalog was able to establish a tablet replica at the newly added tablet server even before LeaderStepDown() had been called. Since the UUIDs of the new and the old leader were the same for the LeaderStepDown() invocation, the implementation was using the short-circuited path (i.e. doing nothing) instead of starting an actual election round. The scenario would fail if the tablet replica hadn't yet been placed at the newly added server by the time of checking for its presence by ListRunningTabletIds(). The fix is trivial: use StartElection() instead of LeaderStepDown(). To verify that this patch fixes the issue, I ran the following command against DEBUG bits built with and without the patch at the same machine. Without the patch, the scenario would fail once in ~150 runs. With the patch, there hasn't been a single failure. ./bin/raft_consensus_election-itest \ --gtest_filter='*TestNewLeaderCantResolvePeers' \ --stress_cpu_threads=24 \ --gtest_repeat=1000 This is a follow-up to f9647149a49ddb87ea0ecf069eab3b5ec0217136. [1] http://dist-test.cloudera.org:8080/test_drilldown?test_name=raft_consensus_election-itest Change-Id: I9f724fee15eec74c068ce0aecfd4544f99a46866 Reviewed-on: http://gerrit.cloudera.org:8080/22389 Tested-by: Kudu Jenkins Reviewed-by: Yifan Zhang <chinazhangyi...@163.com> (cherry picked from commit 6c77ec8752dce6c8253c980c71a25859a3b63f67) Reviewed-on: http://gerrit.cloudera.org:8080/22390 Tested-by: Alexey Serbin <ale...@apache.org> > RaftConsensusElectionITest.TestNewLeaderCantResolvePeers scenario fails from > time to time > ----------------------------------------------------------------------------------------- > > Key: KUDU-3641 > URL: https://issues.apache.org/jira/browse/KUDU-3641 > Project: Kudu > Issue Type: Bug > Components: consensus, test > Affects Versions: 1.17.0, 1.17.1 > Reporter: Alexey Serbin > Assignee: Alexey Serbin > Priority: Major > Attachments: raft_consensus_election-itest.log.xz > > > The {{RaftConsensusElectionITest.TestNewLeaderCantResolvePeers}} scenario of > {{raft_consensus_election-itest}} fails spuriously in DEBUG and ASAN builds > at least with errors like below: > {noformat} > src/kudu/integration-tests/raft_consensus_election-itest.cc:291: Failure > Value of: tablets.empty() > Actual: true > Expected: false > src/kudu/util/test_util.cc:401: Failure > Failed > Timed out waiting for assertion to pass. > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)