kevin-wu24 commented on PR #20859:
URL: https://github.com/apache/kafka/pull/20859#issuecomment-3583469356

   Hmmm, the more I think about this change the more complicated I find this 
new proposed semantic to be.
   
   Here is another case I thought about:
   
   Voter set is (A,B) with leader A. Node C is a caught up observer and tries 
to add itself. The new voter set (A,B,C) is committed by A and B, but C is 
partitioned and never learns its "auto-join" succeeded. The user removes C, and 
the voter set is back to (A,B). When the partition goes away, if 
`updateVoterSet` timer is expired, node C's `shouldSendAddOrRemoveVoter` 
evaluates to `true` and it will try to auto-join again.
   I think in `AddVoterHandler` we check that the node being added is "caught 
up," but that check currently returns true if the node has fetched within the 
last hour (we can make this more strict obviously). 
   
   Another thing we need to consider is that multiple voter set changes may 
occur in the same `FetchRequestResponse`. For example, we could have the 
following:
   
   Voter set is (A,B) with leader A. Node C is a caught up observer and tries 
to add itself. The new voter set (A,B,C) is committed by A and B, but C is 
partitioned and never learns its "auto-join" succeeded. The user removes C, and 
the voter set is back to (A,B). This is similar to the case above, but when the 
partition goes away, suppose node C's `shouldSendAddOrRemoveVoter` evaluates to 
`false`, and node C fetches instead of trying to auto-join. If this fetch 
contains both voter set changes in the same response (the latest voter set is 
(A,B)), the current code will still believe node C is in the `HAS_NOT_JOINED` 
state, and will attempt to auto-join again because we are only checking 
`partitionState.lastVoterSet()`. 
   I still think we can handle this properly in the code, and it means we need 
transition `HAS_NOT_JOINED -> HAS_JOINED` if any `VotersRecord` from the 
`FETCH_RESPONSE` contains the local `ReplicaKey`. We go through the new voter 
sets in `partitionState.updateState()`, and remember we do not need the HWM 
here because to get to the `HAS_NOT_JOINED` state we had to be "caught up" at 
that point in time.
   
   I want to highlight the above two cases to say that despite the desired 
semantic seeming very simple (auto-join at most once per process lifetime), 
implementing this properly is not straightforward at all, since there are a ton 
of edge cases because our local node's voter set can be out-of-date by an 
arbitrary number of VotersRecords. Additionally, at least for me, trying to 
verify that this implementation does indeed match our desired semantic is just 
as difficult (I do think I am out of edge cases now... no promises though 😆 ).
   
   Let me know your thoughts @TaiJuWu @showuon @jsancio.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to