[ 
https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304449#comment-17304449
 ] 

Bankim Bhavsar commented on KUDU-3266:
--------------------------------------

Examined another failure in 
ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1, where 
GetParam() = 3

6fb is the leader that's paused and cba becomes the leader.
cd7 is the follower.

{noformat}
I0317 09:18:20.449790 21057 raft_consensus.cc:479] T 
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2 
FOLLOWER]: Starting leader election (detected failure of leader 
6fb6e93836bb45ae882edd7e0d26c852)
I0317 09:18:20.449846 21057 raft_consensus.cc:3032] T 
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 2 
FOLLOWER]: Advancing to term 3
I0317 09:18:20.459255 21057 raft_consensus.cc:683] T 
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3 
LEADER]: Becoming Leader. State: Replica: cba6c57c99f44acfabfb650e6cb94d06, 
State: Running, Role: LEADER


I0317 09:18:20.841820 21058 sys_catalog.cc:434] T 
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 
[sys.catalog]: SysCatalogTable state changed. Reason: New leader 
cba6c57c99f44acfabfb650e6cb94d06. Latest consensus state: current_term: 3 
leader_uuid: "cba6c57c99f44acfabfb650e6cb94d06" committed_config { opid_index: 
2860 OBSOLETE_local: false peers { permanent_uuid: 
"cba6c57c99f44acfabfb650e6cb94d06" member_type: VOTER last_known_addr { host: 
"127.0.92.125" port: 42749 } } peers { permanent_uuid: 
"cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host: 
"127.0.92.124" port: 35709 } } peers { permanent_uuid: 
"6fb6e93836bb45ae882edd7e0d26c852" member_type: VOTER last_known_addr { host: 
"127.0.92.126" port: 41791 } attrs { promote: false } } }

I0317 09:18:20.842010 21058 sys_catalog.cc:437] T 
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 
[sys.catalog]: This master's current role is: FOLLOWER
{noformat}


Table creation request which could have been replicated to leader cba and 
follower cd7 but not 6fb.
{noformat}
I0317 09:18:22.128876 17932 catalog_manager.cc:1617] Servicing CreateTable 
request from {username='slave'} at 127.0.0.1:47764:
name: "table-0"
{noformat}

Looks like previous leader 6fb is up and leader cba is paused
{noformat}
I0317 09:18:22.342823 18013 raft_consensus.cc:1223] T 
00000000000000000000000000000000 P cd7cbe8654e7426ca818c1c667cef824 [term 3 
FOLLOWER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852 
for earlier term 2. Current term is 3. Ops: []
{noformat}

Open table request fails since it likely went to 6fb which is leader from 
previous term among itself and follower cd7.
{noformat}
Bad status: Not found: Unable to open table: the table does not exist: 
table_name: "table-0"        
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
 Failure
Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't 
generate new fatal failures in the current thread.
  Actual: it does
{noformat}

Moments later 6fb steps down as cd7 is paused and cba (term 3) becomes the 
leader.
{noformat}
W0317 09:18:22.404738 17916 leader_election.cc:334] T 
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 
[CANDIDATE]: Term 3 pre-election: RPC error from VoteRequest() call to peer 
6fb6e93836bb45ae882edd7e0d26c852 (127.0.92.126:41791): Timed out: connection 
negotiation to 127.0.92.126:41791 for RPC RequestConsensusVote timed out after 
1.923s (ON_OUTBOUND_QUEUE)
I0317 09:18:22.407025 17942 raft_consensus.cc:1223] T 
00000000000000000000000000000000 P cba6c57c99f44acfabfb650e6cb94d06 [term 3 
LEADER]: Rejecting Update request from peer 6fb6e93836bb45ae882edd7e0d26c852 
for earlier term 2. Current term is 3. Ops: []
I0317 09:18:22.411103 20992 consensus_queue.cc:1038] T 
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [LEADER]: 
Peer responded invalid term: Peer: permanent_uuid: 
"cd7cbe8654e7426ca818c1c667cef824" member_type: VOTER last_known_addr { host: 
"127.0.92.124" port: 35709 }, Status: INVALID_TERM, Last received: 2.3568, Next 
index: 3569, Last known committed idx: 3572, Time since last communication: 
0.000s
I0317 09:18:22.411545 21016 raft_consensus.cc:3027] T 
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2 
LEADER]: Stepping down as leader of term 2
I0317 09:18:22.411592 21016 raft_consensus.cc:726] T 
00000000000000000000000000000000 P 6fb6e93836bb45ae882edd7e0d26c852 [term 2 
LEADER]: Becoming Follower/Learner. State: Replica: 
6fb6e93836bb45ae882edd7e0d26c852, State: Running, Role: LEADER
{noformat}

> Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition() 
> function
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-3266
>                 URL: https://issues.apache.org/jira/browse/KUDU-3266
>             Project: Kudu
>          Issue Type: Test
>          Components: master, test
>    Affects Versions: 1.15.0
>            Reporter: Bankim Bhavsar
>            Assignee: Bankim Bhavsar
>            Priority: Major
>
> {noformat}
> ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1: 
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119:
>  Failure
> Failed
> Bad status: Not found: Unable to open table: the table does not exist: 
> table_name: "table-1"
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
>  Failure
> Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't 
> generate new fatal failures in the current thread.
>   Actual: it does.
> 2021-03-17T17:04:19Z chronyd exiting
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099:
>  Failure
> Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_) 
> doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> {noformat}
> Although the same verification function is used by other tests for add 
> master, this flakiness started showing up after introduction of the 
> RecoverDeadMaster test.
> https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to