[jira] [Commented] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

ASF GitHub Bot (Jira) Wed, 17 Dec 2025 13:24:19 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045994#comment-18045994
 ]


ASF GitHub Bot commented on KAFKA-17146:
----------------------------------------

mumrah closed pull request #620: KAFKA-17146 Include note to remove migration 
znode
URL: https://github.com/apache/kafka-site/pull/620




> ZK to KRAFT migration stuck in pre-migration mode
> -------------------------------------------------
>
>                 Key: KAFKA-17146
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17146
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, kraft, migration
>    Affects Versions: 3.7.0, 3.7.1
>         Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 
> Zookeeper/KRAFT
>            Reporter: Simone Brundu
>            Priority: Blocker
>              Labels: kraft, migration, zookeeper
>
> I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. 
> (EDIT: the same issue happens for version 3.7.0)
> I'm using this configuration to allow SSL everywhere and, SCRAM 
> authentication only for brokers and PLAIN authentication for controllers 
> {code:java}
> listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL
> inter.broker.listener.name=EXTERNAL_SASL
> sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
> sasl.mechanism=SCRAM-SHA-512
> sasl.mechanism.controller.protocol=PLAIN
> sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
> The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
> cluster with 3 KRAFT controllers is configured and running in parallel as per 
> documentation for the migration process.
> I’ve started the migration with 3 controllers enrolled with SASL_SSL with 
> PLAIN authentication and I already have a strange TRACE log:
> {code:java}
> TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
> controller is not in dual-write mode. Ignoring the change to be replicated to 
> Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> With later this message where KRAFT is waiting to brokers to connect
> {code:java}
> INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting 
> for brokers to register. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon I start to reconfigure the brokers letting them to connect to the new 
> controllers, all good in the KRAFT controllers with notifications that the 
> KRAFT brokers were connecting correctly connected and enrolled
> {code:java}
> INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
> broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
> incarnationId=xxxxxx, brokerEpoch=2638, 
> endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
> port=9095, securityProtocol=3)], 
> features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
> maxSupportedVersion=19)], rack='zur1', fenced=true, 
> inControlledShutdown=false, logDirs=[xxxxxx]) 
> (org.apache.kafka.controller.ClusterControlManager)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon the first broker is connected we start to get these info logs related 
> to the migration process in the controller:
> {code:java}
> INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
> in pre-migration mode. Returning NOT_CONTROLLER. 
> (org.apache.kafka.controller.QuorumController)
> INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
> NotControllerException in 355 microseconds. Exception message: The controller 
> is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code}
> but as well requests to autocreate topics that exist already, in loop every 
> 30seconds, in the last broker restarted:
> {code:java}
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager) {code}
> Up to the moment we have a controller still in the old cluster (in the kafka 
> brokers) everything runs fine. As soon the last node is restarted the things 
> are going out of the rail. This last node never gets any partition assigned 
> and the cluster stays forever in with under replicated partitions. This is 
> the log from the last node register that should start the migration mode, but 
> the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
> {code:java}
> INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
> granted because it has caught up with the offset of its register broker 
> record 4101
> [...]
> INFO [KRaftMigrationDriver id=1000] Ignoring image 
> MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
> lastContainedLogTimeMs=1721133091831) which does not contain a superset of 
> the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> The only way to recover the cluster is revert everything stopping clusters, 
> removing /controller from zookeeper and restore the Zookeeper only 
> configuration in the brokers. A cleanup of the controller is necessary too.
> The migration never starts and the controllers never understand that they 
> have to migrate the data from Zookeeper. More than that, the new controller 
> claims to be the CONTROLLER but it refuses to be it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

Reply via email to