[ 
https://issues.apache.org/jira/browse/KUDU-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304422#comment-17304422
 ] 

Bankim Bhavsar edited comment on KUDU-3266 at 3/18/21, 9:10 PM:
----------------------------------------------------------------

Here is the analysis around what's happening among 3 masters as one of them is 
being paused and CreateTable request and OpenTable request in next iteration.
 
[https://github.com/apache/kudu/blob/master/src/kudu/master/dynamic_multi_master-test.cc#L598-L610]
{code:java}
      LOG(INFO) << "Pausing and resuming individual masters";
      string table_name = kTableName;
      for (int i = 0; i < expected_num_masters; i++) {
        ASSERT_OK(migrated_cluster.master(i)->Pause());
        cluster::ScopedResumeExternalDaemon 
resume_daemon(migrated_cluster.master(i));
        NO_FATALS(cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0));

        // See MasterFailoverTest.TestCreateTableSync to understand why we must
        // check for IsAlreadyPresent as well.
        table_name = Substitute("table-$0", i);
        Status s = CreateTable(&migrated_cluster, table_name);
        ASSERT_TRUE(s.ok() || s.IsAlreadyPresent());
      }
{code}
Consider 3 masters A, B, C.
 - A is the leader and B & C are followers
 - A gets paused
 - B becomes the leader
 - Create table request which gets propagated to B and C forming a quorum.
 - Now A is resumed
 - While A is coming back up, B is paused.
 - C becomes candidate and tries to become leader asking for vote from A. But A 
itself was the leader before it was paused and for some reason doesn't vote.
 - Open table request now goes to table A (the leader) and gets table not found 
error because A didn't receive the create table request when it was down.
 - Moments later B resumes (which was leader before it was paused) and wins the 
election and A steps down. But by this time the test has failed.


was (Author: bankim):
Here is the analysis around what's happening among 3 masters as one of them is 
being paused and CreateTable request and OpenTable request in next iteration.
https://github.com/apache/kudu/blob/master/src/kudu/master/dynamic_multi_master-test.cc#L598-L610
{code}
      LOG(INFO) << "Pausing and resuming individual masters";
      string table_name = kTableName;
      for (int i = 0; i < expected_num_masters; i++) {
        ASSERT_OK(migrated_cluster.master(i)->Pause());
        cluster::ScopedResumeExternalDaemon 
resume_daemon(migrated_cluster.master(i));
        NO_FATALS(cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0));

        // See MasterFailoverTest.TestCreateTableSync to understand why we must
        // check for IsAlreadyPresent as well.
        table_name = Substitute("table-$0", i);
        Status s = CreateTable(&migrated_cluster, table_name);
        ASSERT_TRUE(s.ok() || s.IsAlreadyPresent());
      }
{code}

Consider 3 masters A, B, C.
- A is the leader
- A gets paused
- B becomes the leader
- Create table request which gets propagated to B and C forming a quorum.
- Now A is resumed
- While A is coming back up, B is paused.
- C becomes candidate and tries to become leader asking for vote from A. But A 
itself was the leader before it was paused and for some reason doesn't vote.
- Open table request now goes to table A (the leader) and gets table not found 
error because A didn't receive the create table request when it was down.
- Moments later B resumes (which was leader before it was paused) and wins the 
election and A steps down. But by this time the test has failed.

> Flakiness in dynamic_multi_master_test in VerifyClusterAfterMasterAddition() 
> function
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-3266
>                 URL: https://issues.apache.org/jira/browse/KUDU-3266
>             Project: Kudu
>          Issue Type: Test
>          Components: master, test
>    Affects Versions: 1.15.0
>            Reporter: Bankim Bhavsar
>            Assignee: Bankim Bhavsar
>            Priority: Major
>
> {noformat}
> ParameterizedRecoverMasterTest.TestRecoverDeadMasterSysCatalogCopy/1: 
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/cluster_verifier.cc:119:
>  Failure
> Failed
> Bad status: Not found: Unable to open table: the table does not exist: 
> table_name: "table-1"
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:603:
>  Failure
> Expected: cv.CheckRowCount(table_name, ClusterVerifier::EXACTLY, 0) doesn't 
> generate new fatal failures in the current thread.
>   Actual: it does.
> 2021-03-17T17:04:19Z chronyd exiting
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/dynamic_multi_master-test.cc:1099:
>  Failure
> Expected: VerifyClusterAfterMasterAddition(master_hps, orig_num_masters_) 
> doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> {noformat}
> Although the same verification function is used by other tests for add 
> master, this flakiness started showing up after introduction of the 
> RecoverDeadMaster test.
> https://github.com/apache/kudu/commit/4b4a8c0f2fdfd15524510821b27fc9c3b5d26b6b



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to