Alberto Bustamante Reyes created GEODE-9633:
-----------------------------------------------

             Summary: Region and gateway receiver init order may cause a hang
                 Key: GEODE-9633
                 URL: https://issues.apache.org/jira/browse/GEODE-9633
             Project: Geode
          Issue Type: Bug
            Reporter: Alberto Bustamante Reyes


This ticket has been created as suggested on [the dev 
list|https://markmail.org/thread/qq32z5hducjoqndz].

-----

I have been analyzing an issue that occurs in the following scenario:


1) I start two Geode clusters (cluster1 & cluster2) with one locator and two
servers each.
Both clusters host a partitioned region called "testregion", which is replicated
using a parallel gateway sender and a gateway receiver.
( These are [the gfsh 
files|https://gist.github.com/alb3rtobr/e230623255632937fa68265f31e97f3a] I 
have been using for creating the clusters)

2) I run a client connected to cluster2 performing operations on testregion.


3) cluster1 is stopped and all persistent data is deleted. And then, I create
cluster1 again.


4) At this point, the command to create "testregion" get stuck.


After checking the thread stack and the code, I found that the problem is the
following.


This thread is trapped on an infinite loop waiting for a bucket primary election
at "PartitionedRegion.waitForNoStorageOrPrimary":

{code}
"Function Execution Processor4" tid=0x55
    java.lang.Thread.State: TIMED_WAITING
        at [email protected]/java.lang.Object.wait(Native Method)
        -  waiting on org.apache.geode.internal.cache.BucketAdvisor@28be7ae0
        at
app//org.apache.geode.internal.cache.BucketAdvisor.waitForPrimaryMember(BucketAdvisor.java:1433)
        at
app//org.apache.geode.internal.cache.BucketAdvisor.waitForNewPrimary(BucketAdvisor.java:825)
        at
app//org.apache.geode.internal.cache.BucketAdvisor.getPrimary(BucketAdvisor.java:794)
        at
app//org.apache.geode.internal.cache.partitioned.RegionAdvisor.getPrimaryMemberForBucket(RegionAdvisor.java:1032)
        at
app//org.apache.geode.internal.cache.PartitionedRegion.getBucketPrimary(PartitionedRegion.java:9081)
        at
app//org.apache.geode.internal.cache.PartitionedRegion.waitForNoStorageOrPrimary(PartitionedRegion.java:3249)
        at
app//org.apache.geode.internal.cache.PartitionedRegion.getNodeForBucketWrite(PartitionedRegion.java:3234)
        at
app//org.apache.geode.internal.cache.PartitionedRegion.shadowPRWaitForBucketRecovery(PartitionedRegion.java:10110)
        at
app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:564)
        at
app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderQueue.java:443)
        at
app//org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderEventProcessor.addShadowPartitionedRegionForUserPR(ParallelGatewaySenderEventProcessor.java:195)
        at
app//org.apache.geode.internal.cache.wan.parallel.ConcurrentParallelGatewaySenderQueue.addShadowPartitionedRegionForUserPR(ConcurrentParallelGatewaySenderQueue.java:183)
        at
app//org.apache.geode.internal.cache.PartitionedRegion.postCreateRegion(PartitionedRegion.java:1177)
        at
app//org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3050)
        at
app//org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2910)
        at
app//org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2894)
        at
app//org.apache.geode.cache.RegionFactory.create(RegionFactory.java:773)
{code}

After creating testregion, the sender queue partitioned region is created. While
that region buckets are recovered the command is trapped on an infinite loop
waiting for a primary bucket election at
PartitionedRegion.waitForNoStorageOrPrimary.

This seems to be a known issue because in
PartitionedRegion.getNodeForBucketWrite, there is the following command before
calling waitForNoStorageOrPrimary (and the command has been there since Geode's
first commit!) :

{code}
    // Possible race with loss of redundancy at this point.
    // This loop can possibly create a soft hang if no primary is ever selected.
    // This is preferable to returning null since it will prevent obtaining the
    // bucket lock for bucket creation.
    return waitForNoStorageOrPrimary(bucketId, "write");
{code}

Any idea about why the primary bucket is not elected?

It seems the failure is related with the fact that "testregion" is receiving
updates from the receiver before the "create region" command has finished. If
the test is repeated without traffic on cluster2 or if I create the cluster1's
receiver after creating "testregion", this problem is not happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to