[ https://issues.apache.org/jira/browse/KAFKA-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-14154. ------------------------------------- Resolution: Fixed > Persistent URP after controller soft failure > -------------------------------------------- > > Key: KAFKA-14154 > URL: https://issues.apache.org/jira/browse/KAFKA-14154 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Blocker > Fix For: 3.3.0 > > > We ran into a scenario where a partition leader was unable to expand the ISR > after a soft controller failover. Here is what happened: > Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as > the current controller. > 1. Broker 1 loses its session in Zookeeper. > 2. Broker 2 becomes the new controller. > 3. During initialization, controller 2 removes 1 from the ISR. So state is > updated: leader=2, isr=[2], leader epoch=11. > 4. Broker 2 receives `LeaderAndIsr` from the new controller with leader > epoch=11. > 5. Broker 2 immediately tries to add replica 1 back to the ISR since it is > still fetching and is caught up. However, the > `BrokerToControllerChannelManager` is still pointed at controller 1, so that > is where the `AlterPartition` is sent. > 6. Controller 1 does not yet realize that it is not the controller, so it > processes the `AlterPartition` request. It sees the leader epoch of 11, which > is higher than what it has in its own context. Following changes to the > `AlterPartition` validation in > [https://github.com/apache/kafka/pull/12032/files,] the controller returns > FENCED_LEADER_EPOCH. > 7. After receiving the FENCED_LEADER_EPOCH from the old controller, the > leader is stuck because it assumes that the error implies that another > LeaderAndIsr request should be sent. > Prior to > [https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,], > the way we handled this case was a little different. We only verified that > the leader epoch in the request was at least as large as the current epoch in > the controller context. Anything higher was accepted. The controller would > have attempted to write the updated state to Zookeeper. This update would > have failed because of the controller epoch check, however, we would have > returned NOT_CONTROLLER in this case, which is handled in > `AlterPartitionManager`. > It is tempting to revert the logic, but the risk is in the idempotency check: > [https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.] > If the AlterPartition request happened to match the state inside the old > controller, the controller would consider the update successful and return no > error. But if its state was already stale at that point, then that might > cause the leader to incorrectly assume that the state had been updated. > One way to fix this problem without weakening the validation is to rely on > the controller epoch in `AlterPartitionManager`. When we discover a new > controller, we also discover its epoch, so we can pass that through. The > `LeaderAndIsr` request already includes the controller epoch of the > controller that sent it and we already propagate this through to > `AlterPartition.submit`. Hence all we need to do is verify that the epoch of > the current controller target is at least as large as the one discovered > through the `LeaderAndIsr`. -- This message was sent by Atlassian Jira (v8.20.10#820010)