[jira] [Commented] (KUDU-3641) RaftConsensusElectionITest.TestNewLeaderCantResolvePeers scenario fails from time to time

ASF subversion and git services (Jira) Sat, 25 Jan 2025 18:40:53 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17917000#comment-17917000
 ]


ASF subversion and git services commented on KUDU-3641:
-------------------------------------------------------

Commit 0c47a46e41235020337984a6053d3b7e3964092b in kudu's branch 
refs/heads/branch-1.18.x from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0c47a46e4 ]

KUDU-3641 fix flaky TestNewLeaderCantResolvePeers

I noticed that RaftConsensusElectionITest.TestNewLeaderCantResolvePeers
scenario was failing from time to time in pre-commit tests, and the same
issue was also exposed by the flaky tests dashboard [1].

The scenario would usually succeed because in most cases the system
catalog was able to establish a tablet replica at the newly added tablet
server even before LeaderStepDown() had been called.  Since the UUIDs
of the new and the old leader were the same for the LeaderStepDown()
invocation, the implementation was using the short-circuited path
(i.e. doing nothing) instead of starting an actual election round.
The scenario would fail if the tablet replica hadn't yet been placed
at the newly added server by the time of checking for its presence by
ListRunningTabletIds().

The fix is trivial: use StartElection() instead of LeaderStepDown().

To verify that this patch fixes the issue, I ran the following command
against DEBUG bits built with and without the patch at the same machine.
Without the patch, the scenario would fail once in ~150 runs.
With the patch, there hasn't been a single failure.

  ./bin/raft_consensus_election-itest \
    --gtest_filter='*TestNewLeaderCantResolvePeers' \
    --stress_cpu_threads=24 \
    --gtest_repeat=1000

This is a follow-up to f9647149a49ddb87ea0ecf069eab3b5ec0217136.

[1] 
http://dist-test.cloudera.org:8080/test_drilldown?test_name=raft_consensus_election-itest

Change-Id: I9f724fee15eec74c068ce0aecfd4544f99a46866
Reviewed-on: http://gerrit.cloudera.org:8080/22389
Tested-by: Kudu Jenkins
Reviewed-by: Yifan Zhang <chinazhangyi...@163.com>
(cherry picked from commit 6c77ec8752dce6c8253c980c71a25859a3b63f67)
Reviewed-on: http://gerrit.cloudera.org:8080/22390
Tested-by: Alexey Serbin <ale...@apache.org>


> RaftConsensusElectionITest.TestNewLeaderCantResolvePeers scenario fails from 
> time to time
> -----------------------------------------------------------------------------------------
>
>                 Key: KUDU-3641
>                 URL: https://issues.apache.org/jira/browse/KUDU-3641
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, test
>    Affects Versions: 1.17.0, 1.17.1
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>         Attachments: raft_consensus_election-itest.log.xz
>
>
> The {{RaftConsensusElectionITest.TestNewLeaderCantResolvePeers}} scenario of 
> {{raft_consensus_election-itest}} fails spuriously in DEBUG and ASAN builds 
> at least with errors like below:
> {noformat}
> src/kudu/integration-tests/raft_consensus_election-itest.cc:291: Failure
> Value of: tablets.empty()
>   Actual: true
> Expected: false
> src/kudu/util/test_util.cc:401: Failure
> Failed
> Timed out waiting for assertion to pass.
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KUDU-3641) RaftConsensusElectionITest.TestNewLeaderCantResolvePeers scenario fails from time to time

Reply via email to