[ https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304449#comment-17304449 ]
Bankim Bhavsar commented on KUDU-3266: -------------------------------------- Examined another failure in ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1, where GetParam() = 3 6fb is the leader that's paused and cba becomes the leader. cd7 is the follower. {noformat} I0317 09:18:20.449790 21057 raft_consensus.cc:479] T 00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2 FOLLOWER]: Starting leader election (detected failure of leader 6fb6e93836bb45ae882edd7e0d26c852) I0317 09:18:20.449846 21057 raft_consensus.cc:3032] T 00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2 FOLLOWER]: Advancing to term 3 I0317 09:18:20.459255 21057 raft_consensus.cc:683] T 00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3 LEADER]: Becoming Leader. State: Replica: cba6c57c99f44acfabfb650e6cb94d06, State: Running, Role: LEADER I0317 09:18:20.841820 21058 sys_catalog.cc:434] T 00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 [sys.catalog]: SysCatalogTable state changed. Reason: New leader cba6c57c99f44acfabfb650e6cb94d06. Latest consensus state: current_term: 3 leader_uuid: "cba6c57c99f44acfabfb650e6cb94d06" committed_config { opid_index: 2860 OBSOLETE_local: false peers { permanent_uuid: "cba6c57c99f44acfabfb650e6cb94d06" member_type: VOTER last_known_addr { host: "127.0.92.125" port: 42749 } } peers { permanent_uuid: "cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host: "127.0.92.124" port: 35709 } } peers { permanent_uuid: "6fb6e93836bb45ae882edd7e0d26c852" member_type: VOTER last_known_addr { host: "127.0.92.126" port: 41791 } attrs { promote: false } } } I0317 09:18:20.842010 21058 sys_catalog.cc:437] T 00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 [sys.catalog]: This master's current role is: FOLLOWER {noformat} Table creation request which could have been replicated to leader cba and follower cd7 but not 6fb. {noformat} I0317 09:18:22.128876 17932 catalog_manager.cc:1617] Servicing CreateTable request from {username='slave'} at 127.0.0.1:47764: name: "table-0" {noformat} Looks like previous leader 6fb is up and leader cba is paused {noformat} I0317 09:18:22.342823 18013 raft_consensus.cc:1223] T 00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 [term 3 FOLLOWER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852 for earlier term 2. Current term is 3. Ops: [] {noformat} Open table request fails since it likely went to 6fb which is leader from previous term among itself and follower cd7. {noformat} Bad status: Not found: Unable to open table: the table does not exist: table_name: "table-0" /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603: Failure Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't generate new fatal failures in the current thread. Actual: it does {noformat} Moments later 6fb steps down as cd7 is paused and cba (term 3) becomes the leader. {noformat} W0317 09:18:22.404738 17916 leader_election.cc:334] T 00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [CANDIDATE]: Term 3 pre-election: RPC error from VoteRequest() call to peer 6fb6e93836bb45ae882edd7e0d26c852 (127.0.92.126:41791): Timed out: connection negotiation to 127.0.92.126:41791 for RPC RequestConsensusVote timed out after 1.923s (ON_OUTBOUND_QUEUE) I0317 09:18:22.407025 17942 raft_consensus.cc:1223] T 00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3 LEADER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852 for earlier term 2. Current term is 3. Ops: [] I0317 09:18:22.411103 20992 consensus_queue.cc:1038] T 00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [LEADER]: Peer responded invalid term: Peer: permanent_uuid: "cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host: "127.0.92.124" port: 35709 }, Status: INVALID_TERM, Last received: 2.3568, Next index: 3569, Last known committed idx: 3572, Time since last communication: 0.000s I0317 09:18:22.411545 21016 raft_consensus.cc:3027] T 00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2 LEADER]: Stepping down as leader of term 2 I0317 09:18:22.411592 21016 raft_consensus.cc:726] T 00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2 LEADER]: Becoming Follower/Learner. State: Replica: 6fb6e93836bb45ae882edd7e0d26c852, State: Running, Role: LEADER {noformat} > Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition() > function > ------------------------------------------------------------------------------------- > > Key: KUDU-3266 > URL: https://issues.apache.org/jira/browse/KUDU-3266 > Project: Kudu > Issue Type: Test > Components: master, test > Affects Versions: 1.15.0 > Reporter: Bankim Bhavsar > Assignee: Bankim Bhavsar > Priority: Major > > {noformat} > ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119: > Failure > Failed > Bad status: Not found: Unable to open table: the table does not exist: > table_name: "table-1" > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603: > Failure > Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't > generate new fatal failures in the current thread. > Actual: it does. > 2021-03-17T17:04:19Z chronyd exiting > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099: > Failure > Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_) > doesn't generate new fatal failures in the current thread. > Actual: it does. > {noformat} > Although the same verification function is used by other tests for add > master, this flakiness started showing up after introduction of the > RecoverDeadMaster test. > https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b -- This message was sent by Atlassian Jira (v8.3.4#803005)