kevin-wu24 commented on PR #20859: URL: https://github.com/apache/kafka/pull/20859#issuecomment-3584172984
> we can resolve this issue because the state will move to HAS_JOINED when finding the voter contains itself. But we can handle the issue separately since it's not directly related to this issue. Opened [KAFKA-19933](https://issues.apache.org/jira/browse/KAFKA-19933). The follower can only move to HAS_JOINED via processing a FETCH_RESPONSE, which is too late (i.e. the local node has already sent an "incorrect" auto-join). In the first scenario, based on whether the `UpdateVoterSet` timer is expired or not, the follower will either send an "incorrect" AddVoterRequest, or send a FetchRequest (this will lead to the correct state assuming we get a response). IMO this is the harder case to handle, and handling this on the leader side via `KAFKA-19933` is kind of "wrong" too. I think the observer needs to look at its fetch timer, and if that is expired, sending a FETCH takes priority over running the auto-join algorithm. That way, long-partitioned followers will fetch first, and avoid the above case. The proposed fix here is to add a `&& !hasFetchTimeoutExpired()` to `shouldSendAddOrRemoveVoter`. I think my previous assertion about the conditions of transitioning from `HAS_NOT_JOINED -> HAS_JOINED` are incorrect. If the local node C, which currently believes the voter set as (A,B) and is in `HAS_NOT_JOINED` state, is so far behind it requires two fetches, and the first fetch does not contain any new VotersRecords, but the second fetch would contain the records (A,B,C) and (A,B), local node C would still try to auto-join "incorrectly"... We might need to add another clause in `shouldSendAddOrRemoveVoter`: `&& local node LEO >= leader HWM`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
