kevin-wu24 commented on PR #20859: URL: https://github.com/apache/kafka/pull/20859#issuecomment-3583469356
Hmmm, the more I think about this change the more complicated I find this new proposed semantic to be. Here is another case I thought about: Voter set is (A,B) with leader A. Node C is a caught up observer and tries to add itself. The new voter set (A,B,C) is committed by A and B, but C is partitioned and never learns its "auto-join" succeeded. The user removes C, and the voter set is back to (A,B). When the partition goes away, if `updateVoterSet` timer is expired, node C's `shouldSendAddOrRemoveVoter` evaluates to `true` and it will try to auto-join again. I think in `AddVoterHandler` we check that the node being added is "caught up," but that check currently returns true if the node has fetched within the last hour (we can make this more strict obviously). Another thing we need to consider is that multiple voter set changes may occur in the same `FetchRequestResponse`. For example, we could have the following: Voter set is (A,B) with leader A. Node C is a caught up observer and tries to add itself. The new voter set (A,B,C) is committed by A and B, but C is partitioned and never learns its "auto-join" succeeded. The user removes C, and the voter set is back to (A,B). This is similar to the case above, but when the partition goes away, suppose node C's `shouldSendAddOrRemoveVoter` evaluates to `false`, and node C fetches instead of trying to auto-join. If this fetch contains both voter set changes in the same response (the latest voter set is (A,B)), the current code will still believe node C is in the `HAS_NOT_JOINED` state, and will attempt to auto-join again because we are only checking `partitionState.lastVoterSet()`. I still think we can handle this properly in the code, and it means we need transition `HAS_NOT_JOINED -> HAS_JOINED` if any `VotersRecord` from the `FETCH_RESPONSE` contains the local `ReplicaKey`. We go through the new voter sets in `partitionState.updateState()`, and remember we do not need the HWM here because to get to the `HAS_NOT_JOINED` state we had to be "caught up" at that point in time. I want to highlight the above two cases to say that despite the desired semantic seeming very simple (auto-join at most once per process lifetime), implementing this properly is not straightforward at all, since there are a ton of edge cases because our local node's voter set can be out-of-date by an arbitrary number of VotersRecords. Additionally, at least for me, trying to verify that this implementation does indeed match our desired semantic is just as difficult (I do think I am out of edge cases now... no promises though 😆 ). Let me know your thoughts @TaiJuWu @showuon @jsancio. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
