[ https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304422#comment-17304422 ]
Bankim Bhavsar edited comment on KUDU-3266 at 3/18/21, 9:10 PM: ---------------------------------------------------------------- Here is the analysis around what's happening among 3 masters as one of them is being paused and CreateTable request and OpenTable request in next iteration. [https://github.com/apache/kudu/blob/master/src/kudu/master/dynamic_multi_master-test.cc#L598-L610] {code:java} LOG(INFO) << "Pausing and resuming individual masters"; string table_name = kTableName; for (int i = 0; i < expected_num_masters; i++) { ASSERT_OK(migrated_cluster.master(i)->Pause()); cluster::ScopedResumeExternalDaemon resume_daemon(migrated_cluster.master(i)); NO_FATALS(cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0)); // See MasterFailoverTest.TestCreateTableSync to understand why we must // check for IsAlreadyPresent as well. table_name = Substitute("table-$0", i); Status s = CreateTable(&migrated_cluster, table_name); ASSERT_TRUE(s.ok() || s.IsAlreadyPresent()); } {code} Consider 3 masters A, B, C. - A is the leader and B & C are followers - A gets paused - B becomes the leader - Create table request which gets propagated to B and C forming a quorum. - Now A is resumed - While A is coming back up, B is paused. - C becomes candidate and tries to become leader asking for vote from A. But A itself was the leader before it was paused and for some reason doesn't vote. - Open table request now goes to table A (the leader) and gets table not found error because A didn't receive the create table request when it was down. - Moments later B resumes (which was leader before it was paused) and wins the election and A steps down. But by this time the test has failed. was (Author: bankim): Here is the analysis around what's happening among 3 masters as one of them is being paused and CreateTable request and OpenTable request in next iteration. https://github.com/apache/kudu/blob/master/src/kudu/master/dynamic_multi_master-test.cc#L598-L610 {code} LOG(INFO) << "Pausing and resuming individual masters"; string table_name = kTableName; for (int i = 0; i < expected_num_masters; i++) { ASSERT_OK(migrated_cluster.master(i)->Pause()); cluster::ScopedResumeExternalDaemon resume_daemon(migrated_cluster.master(i)); NO_FATALS(cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0)); // See MasterFailoverTest.TestCreateTableSync to understand why we must // check for IsAlreadyPresent as well. table_name = Substitute("table-$0", i); Status s = CreateTable(&migrated_cluster, table_name); ASSERT_TRUE(s.ok() || s.IsAlreadyPresent()); } {code} Consider 3 masters A, B, C. - A is the leader - A gets paused - B becomes the leader - Create table request which gets propagated to B and C forming a quorum. - Now A is resumed - While A is coming back up, B is paused. - C becomes candidate and tries to become leader asking for vote from A. But A itself was the leader before it was paused and for some reason doesn't vote. - Open table request now goes to table A (the leader) and gets table not found error because A didn't receive the create table request when it was down. - Moments later B resumes (which was leader before it was paused) and wins the election and A steps down. But by this time the test has failed. > Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition() > function > ------------------------------------------------------------------------------------- > > Key: KUDU-3266 > URL: https://issues.apache.org/jira/browse/KUDU-3266 > Project: Kudu > Issue Type: Test > Components: master, test > Affects Versions: 1.15.0 > Reporter: Bankim Bhavsar > Assignee: Bankim Bhavsar > Priority: Major > > {noformat} > ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119: > Failure > Failed > Bad status: Not found: Unable to open table: the table does not exist: > table_name: "table-1" > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603: > Failure > Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't > generate new fatal failures in the current thread. > Actual: it does. > 2021-03-17T17:04:19Z chronyd exiting > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099: > Failure > Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_) > doesn't generate new fatal failures in the current thread. > Actual: it does. > {noformat} > Although the same verification function is used by other tests for add > master, this flakiness started showing up after introduction of the > RecoverDeadMaster test. > https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b -- This message was sent by Atlassian Jira (v8.3.4#803005)