[
https://issues.apache.org/jira/browse/GEODE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339313#comment-17339313
]
Xiaojian Zhou edited comment on GEODE-9191 at 5/27/21, 6:30 PM:
----------------------------------------------------------------
More investigation found that the primary buckets could switch at any time
especially when they are not balanced (usually happened in GII). We need to
lock the primary from moving.
The revised design will be:
(1) coordinator(a server) assignAllBuckets.
(2) coordinator lock local primary buckets and sends lock message to all peer
members.
(3) upon received the lock message, each datastore server will:
- lockBucketCreationForRegionClear (Maybe we don't need it)
- waits for all the primaries to show up
- iterate through local primary bucket list to lock primary from moving, then
lock RVV
- reply with number of buckets locked
(4) coordinator collected locked bucket numbers from each member, if matched
the expected total bucket number (i.e. default is 113), move on to next step.
Otherwise, retry.
- It's possible while iterating through the local primary bucket list, some of
the primary bucket is no longer primary, in this case, the sum of locked
primary bucket numbers that collected by the coordinator could be different
with the expected total bucket number.
Then coordinator will unlock all the members and retry.
- Retry until succeed or failed with PartialClearException. Retry will usually
succeed, unless there're too many servers shutdown. In that case,
waitForPrimary will fail with PartialClearException and break the endless
retry.
(5) If a member is down, the membership listener will detected and let
coordinator to retry. If too many members are down, wait for primary should
fail with PartitionedRegionPartialClearException. Then coordinator will unlock
and throw this exception to caller. If the coordinator is down, the pr clear
will fail.
Note: membership listener will trigger coordinator to retry from beginning, not
to let each member to retry locally, because the primary list might have
changed.
(6) After locked all the members' primary buckets (both locked primary and
locked RVV), the coordinator sends clear message to all the members.
(7) each member clear primary buckets one by one and return number of buckets
cleared.
(8) Coordinator collect all the numbers cleared, if less than expected bucket
number, retry. This could happen when a member is offline in the middle of
clear. The retry should succeed finally unless too many servers are down. Then
the waitForPrimary will throw PartialClearException.
(9) In unlock, the coordinator should send UNLOCK message to all the members to
unlock not only primary buckets, because the primary list at each member could
have been changed.
- should iterate through all the local buckets and unlock RVV, then unlock
primary moving.
- getLockRequester()==null means coordinator is down. In that case, we should
still do unlock.
- Since the PR clear will retry forever until succeeded or fail with
ParticlalClearException due to too many members are down, there's no need to
lockBucketCreationForRegionClear.
- if a member is down, the listener is triggered at all the members, these
members will check who is down. If the down member is the non-coordinator, then
coordinator will notify all other members to unlock and cleanup. If it’s the
coordinator shutdown, other members should unlock and cleanup.
- if coordinator is down: should rely on listener to let each member to unlock
their own local buckets (unlock on unlocked buckets is ok)
Test cases to be added in other geode ticket:
(1) secondary become new primary which is unlocked, how the on-going operation
sync with pr clear
(2) a new operation create a new bucket which is also a primary, how to sync
with pr clear|
(3) fixed partition, how it sync with assignAllBuckets and PR clear
(4) if shutdown 2 servers in redundancy=1, should call assignAllBuckets() to
recreate buckets in existing servers, unless PartitionOfflineException
was (Author: zhouxj):
More investigation found that the primary buckets could switch at any time
especially when they are not balanced (usually happened in GII). We need to
lock the primary from moving.
The revised design will be:
(1) coordinator(a server) assignAllBuckets.
(2) coordinator lock local primary buckets and sends lock message to all peer
members.
(3) upon received the lock message, each datastore server will:
- lockBucketCreationForRegionClear (Maybe we don't need it)
- waits for all the primaries to show up
- iterate through local primary bucket list to lock primary from moving, then
lock RVV
- reply with number of buckets locked
(4) coordinator collected locked bucket numbers from each member, if matched
the expected total bucket number (i.e. default is 113), move on to next step.
Otherwise, retry.
- It's possible while iterating through the local primary bucket list, some of
the primary bucket is no longer primary, in this case, the sum of locked
primary bucket numbers that collected by the coordinator could be different
with the expected total bucket number.
Then coordinator will unlock all the members and retry.
- Retry until succeed or failed with PartialClearException. Retry will usually
succeed, unless there're too many servers shutdown. In that case,
waitForPrimary will fail with PartialClearException and break the endless
retry.
(5) If a member is down, the membership listener will detected and let
coordinator to retry. If too many members are down, wait for primary should
fail with PartitionedRegionPartialClearException. Then coordinator will unlock
and throw this exception to caller. If the coordinator is down, the pr clear
will fail.
Note: membership listener will trigger coordinator to retry from beginning, not
to let each member to retry locally, because the primary list might have
changed.
(6) After locked all the members' primary buckets (both locked primary and
locked RVV), the coordinator sends clear message to all the members.
(7) each member clear primary buckets one by one and return number of buckets
cleared.
(8) Coordinator collect all the numbers cleared, if less than expected bucket
number, retry. This could happen when a member is offline in the middle of
clear. The retry should succeed finally unless too many servers are down. Then
the waitForPrimary will throw PartialClearException.
(9) In unlock, the coordinator should send UNLOCK message to all the members to
unlock not only primary buckets, because the primary list at each member could
have been changed.
- should iterate through all the local buckets and unlock RVV, then unlock
primary moving.
- getLockRequester()==null means coordinator is down. In that case, we should
still do unlock.
- Since the PR clear will retry forever until succeeded or fail with
ParticlalClearException due to too many members are down, there's no need to
lockBucketCreationForRegionClear.
> PR clear could miss clearing bucket which lost primary
> ------------------------------------------------------
>
> Key: GEODE-9191
> URL: https://issues.apache.org/jira/browse/GEODE-9191
> Project: Geode
> Issue Type: Sub-task
> Reporter: Xiaojian Zhou
> Assignee: Xiaojian Zhou
> Priority: Major
> Labels: GeodeOperationAPI, pull-request-available
>
> This scenario is found when introducing GII test case for PR clear. The
> sequence is:
> (1) there're 3 servers, server1 is accessor, server2 and server3 are
> datastores.
> (2) shutdown server2
> (3) send PR clear from server1 (accessor) and restart server2 at the same
> time. There's a race that server2 did not receive the
> PartitionedRegionClearMessage.
> (4) server2 finished GII
> (5) only server3 received PartitionedRegionClearMessage and it hosts all the
> primary buckets. When PR clear thread iterates through these primary buckets
> one by one, some of them might lose primary to server2.
> (6) BR.cmnClearRegion will return immediately since it's no longer primary,
> but clearedBuckets.add(localPrimaryBucketRegion.getId()); will still be
> called. So from the caller point of view, this bucket is cleared. It wouldn't
> even throw PartitionedRegionPartialClearException.
> The problem is:
> before calling cmnClearRegion, we should call BR.doLockForPrimary to make
> sure it's still primary. If not, throw exception. Then
> clearedBuckets.add(localPrimaryBucketRegion.getId()); will not be called for
> this bucket.
> The expected behavior in this scenario is to throw
> PartitionedRegionPartialClearException.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)