[ 
https://issues.apache.org/jira/browse/GEODE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221777#comment-17221777
 ] 

Dan Smith commented on GEODE-8659:
----------------------------------

This appears to be a race condition during the join process. In the logs, there 
are two locators, one with port 41002 (locator A) and one with port 41003 
(locator B). Locator A is configured with a locator-wait-time and the address 
of locator B. Locator B has no wait time and no knowledge of other locators.

What I see in the test is this:
1. Locator A starts up. It is spinning and checking for Locator B to start
{noformat}
info 2020/10/26 17:42:42.970 GMT <pool-3-thread-1> tid=93] Unable to contact 
locator HostAndPort[5199b02bbb6f:32857]: java.io.EOFException: Locator at 
HostAndPort[5199b02bbb6f:32857] did not respond. This is normal if the locator 
was shutdown. If it wasn't check its log for exceptions.


[info 2020/10/26 17:42:43.975 GMT <pool-3-thread-1> tid=93] Discovery state 
after looking for membership coordinator is locatorsContacted=1; 
findInViewResponses=0; alreadyTried=[]; 
registrants=[172.17.0.34(1:locator)<ec>:41002]; 
possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; 
hasContactedAJoinedLocator=false; view=null; responses=[]

{noformat}
2. Locator B starts. It creates a distributed system and becomes the coordinator
{noformat}
[info 2020/10/26 17:42:44.818 GMT <Test worker> tid=27] Starting membership 
services

[info 2020/10/26 17:42:45.003 GMT <Test worker> tid=27] This member is becoming 
the membership coordinator with address 172.17.0.34(1:locator)<ec>:41003

[info 2020/10/26 17:42:45.010 GMT <Test worker> tid=27] Finished joining (took 
60ms).

{noformat}

3. Locator A manages to contact Locator B after it has created the system but 
*before* it has installed a view. This means it gets an interesting response. 
It is not a failure, but it also does not have any joined members.
{noformat}
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] received 
FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec><v0>:41003, 
fromView=false, viewId=null, registrants=[], 
senderId=172.17.0.34(1:locator)<ec><v0>:41003, network partition detection 
enabled=true, locators preferred as coordinators=true, view=null) from locator 
HostAndPort[5199b02bbb6f:32857]

[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] Locator's address 
indicates it is part of a distributed system so I will not become membership 
coordinator on this attempt to join

[info 2020/10/26 17:42:45.022 GMT <pool-3-thread-1> tid=93] received 
FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec>:41002, 
fromView=false, viewId=null, registrants=[172.17.0.34(1:locator)<ec>:41002], 
senderId=172.17.0.34(1:locator)<ec>:41002, network partition detection 
enabled=true, locators preferred as coordinators=true, view=null) from locator 
HostAndPort[5199b02bbb6f:46597]

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] findCoordinator 
chose 172.17.0.34(1:locator)<ec>:41002 out of these possible coordinators: 
[172.17.0.34(1:locator)<ec><v0>:41003, 172.17.0.34(1:locator)<ec>:41002]

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] Discovery state 
after looking for membership coordinator is locatorsContacted=2; 
findInViewResponses=0; alreadyTried=[]; 
registrants=[172.17.0.34(1:locator)<ec>:41002]; 
possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1; 
hasContactedAJoinedLocator=true; view=null; responses=[]

{noformat}

4. It appears because Locator A sees that it has contacted all of the locators, 
and there is not actually a view in that response, it decides it should become 
the coordinator as well, creating a split brain.
{noformat}
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] found possible 
coordinator 172.17.0.34(1:locator)<ec>:41002

[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] This member is 
becoming the membership coordinator with address 
172.17.0.34(1:locator)<ec>:41002

{noformat}

The problem seems to be the above response. Locator B is sending a response to 
Locator A, but that is not informing Locator A that it should not become the 
coordinator. It believe it's passing through this check because 
*joinedMembersContacted* is still zero:
{code}
            if (state.joinedMembersContacted <= 0 && ((now >= locatorGiveUpTime 
&&
                tries >= minimumRetriesBeforeBecomingCoordinator) ||
                state.locatorsContacted >= locators.size())) {
              synchronized (viewInstallationLock) {
                becomeCoordinator();
              }

{code}

There is *another* flag on SearchState that we are not checking in the above 
block, which is *hasContactedAJoinedLocator*. So maybe we could check that as 
well? However, it's not clear to me that it will definitely always be true in 
the above case. Ideally, a locator will either (A) not respond, or (B) respond 
and affirmatively indicate that it's becoming the coordinator in this case.

> MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails with 
> ConditionTimeoutException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8659
>                 URL: https://issues.apache.org/jira/browse/GEODE-8659
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, tests
>            Reporter: Kirk Lund
>            Priority: Major
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=  Test Results URI 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-results/integrationTest/1603735181/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-artifacts/1603735181/integrationtestfiles-geode-pr-5667.tgz
> {noformat}
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest
>  > secondMembershipPausesForLocatorWaitTime FAILED
>     org.awaitility.core.ConditionTimeoutException: Assertion condition 
> defined as a lambda expression in 
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest
>  that uses org.apache.geode.distributed.internal.membership.api.Membership 
>     Expected size:<2> but was:<1> in:
>     <[172.17.0.34(1:locator)<ec><v0>:41003]> within 5 minutes.
>         at 
> org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
>         at 
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
>         at 
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
>         at 
> org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
>         at 
> org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
>         at 
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime(MembershipIntegrationTest.java:270)
>         Caused by:
>         java.lang.AssertionError: 
>         Expected size:<2> but was:<1> in:
>         <[172.17.0.34(1:locator)<ec><v0>:41003]>
>             at 
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.lambda$secondMembershipPausesForLocatorWaitTime$8(MembershipIntegrationTest.java:271)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to