[
https://issues.apache.org/jira/browse/KAFKA-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Arthur resolved KAFKA-16171.
----------------------------------
Resolution: Fixed
> Controller failover during ZK migration can prevent metadata updates to ZK
> brokers
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-16171
> URL: https://issues.apache.org/jira/browse/KAFKA-16171
> Project: Kafka
> Issue Type: Bug
> Components: controller, kraft, migration
> Affects Versions: 3.6.0, 3.7.0, 3.6.1
> Reporter: David Arthur
> Assignee: David Arthur
> Priority: Blocker
> Fix For: 3.6.2, 3.7.0
>
>
> h2. Description
> During the ZK migration, after KRaft becomes the active controller we enter a
> state called hybrid mode. This means we have a mixture of ZK and KRaft
> brokers. The KRaft controller updates the ZK brokers using the deprecated
> controller RPCs (LeaderAndIsr, UpdateMetadata, etc).
>
> A race condition exists where the KRaft controller will get stuck in a retry
> loop while initializing itself after a failover which prevents it from
> sending these RPCs to ZK brokers.
> h2. Impact
> Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK
> brokers will not receive any metadata updates. The ZK brokers will be able to
> send requests to the controller (such as AlterPartitions), but the metadata
> updates which come as a result of those requests will never be seen. This
> essentially looks like the controller is unavailable from the ZK brokers
> perspective.
> h2. Detection and Mitigation
> This bug can be seen by observing failed ZK writes from a recently elected
> controller.
> The tell-tale error message is:
> {code:java}
> Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This
> indicates that another KRaft controller is making writes to ZooKeeper. {code}
> with a stacktrace like:
> {noformat}
> java.lang.RuntimeException: Check op on KRaft Migration ZNode failed.
> Expected zkVersion = 507823. This indicates that another KRaft controller is
> making writes to ZooKeeper.
> at
> kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613)
> at
> kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639)
> at
> kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664)
> at
> scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
> at
> scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
> at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43)
> at
> kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664)
> at
> kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158)
> at
> kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438)
> at java.base/java.lang.Iterable.forEach(Iterable.java:75)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115)
> at
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at
> org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66){noformat}
> To mitigate this problem, a new KRaft controller should be elected. This can
> be done by restarting the problematic active controller. To verify that the
> new controller does not encounter the race condition, look for
> {code:java}
> [KRaftMigrationDriver id=9991] 9991 transitioning from SYNC_KRAFT_TO_ZK to
> KRAFT_CONTROLLER_TO_BROKER_COMM state {code}
>
> h2. Details
> Controller A loses leadership via Raft event (e.g., from a timeout in the
> Raft layer). A KRaftLeaderEvent is added to KRaftMigrationDriver event queue
> behind any pending MetadataChangeEvents.
>
> Controller B is elected and a KRaftLeaderEvent is added to
> KRaftMigrationDriver's queue. Since this controller is inactive, it processes
> the event immediately. This event simply loads the migration state from ZK
> (/migration) to check if the migration has been completed. This information
> is used to determine the downstream transitions in the state machine.
> Controller B passes through WAIT_FOR_ACTIVE_CONTROLLER and transitions to
> BECOME_CONTROLLER since the migration is done. While handling the
> BecomeZkControllerEvent, the controller forcibly takes ZK controller
> leadership by writing its ID into /controller and its epoch into
> /controller_epoch.
>
> The change to /controller_epoch causes all of the pending writes on
> Controller A to fail since those writes are doing a check op on
> /controller_epoch as part of the multi-op writes to ZK.
>
> However, there is a race between Controller B loading the state in /migration
> and when it updates /controller_epoch. It is possible for Controller A to
> successfully write to ZK with its older epoch. This causes the znode version
> of /migration to increase which will cause Controller B to get stuck.
>
> It is safe for the old controller to be making these writes, since we only
> dual-write committed state from KRaft (i.e., “write-behind), but this race
> causes the new controller to have a stale version of /migration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)