[
https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045994#comment-18045994
]
ASF GitHub Bot commented on KAFKA-17146:
----------------------------------------
mumrah closed pull request #620: KAFKA-17146 Include note to remove migration
znode
URL: https://github.com/apache/kafka-site/pull/620
> ZK to KRAFT migration stuck in pre-migration mode
> -------------------------------------------------
>
> Key: KAFKA-17146
> URL: https://issues.apache.org/jira/browse/KAFKA-17146
> Project: Kafka
> Issue Type: Bug
> Components: controller, kraft, migration
> Affects Versions: 3.7.0, 3.7.1
> Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3
> Zookeeper/KRAFT
> Reporter: Simone Brundu
> Priority: Blocker
> Labels: kraft, migration, zookeeper
>
> I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster.
> (EDIT: the same issue happens for version 3.7.0)
> I'm using this configuration to allow SSL everywhere and, SCRAM
> authentication only for brokers and PLAIN authentication for controllers
> {code:java}
> listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL
> inter.broker.listener.name=EXTERNAL_SASL
> sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
> sasl.mechanism=SCRAM-SHA-512
> sasl.mechanism.controller.protocol=PLAIN
> sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
> The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers
> cluster with 3 KRAFT controllers is configured and running in parallel as per
> documentation for the migration process.
> I’ve started the migration with 3 controllers enrolled with SASL_SSL with
> PLAIN authentication and I already have a strange TRACE log:
> {code:java}
> TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the
> controller is not in dual-write mode. Ignoring the change to be replicated to
> Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> With later this message where KRAFT is waiting to brokers to connect
> {code:java}
> INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting
> for brokers to register.
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon I start to reconfigure the brokers letting them to connect to the new
> controllers, all good in the KRAFT controllers with notifications that the
> KRAFT brokers were connecting correctly connected and enrolled
> {code:java}
> INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for
> broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true,
> incarnationId=xxxxxx, brokerEpoch=2638,
> endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01',
> port=9095, securityProtocol=3)],
> features=[BrokerFeature(name='metadata.version', minSupportedVersion=19,
> maxSupportedVersion=19)], rack='zur1', fenced=true,
> inControlledShutdown=false, logDirs=[xxxxxx])
> (org.apache.kafka.controller.ClusterControlManager)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to
> register with KRaft.
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to
> register with KRaft.
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon the first broker is connected we start to get these info logs related
> to the migration process in the controller:
> {code:java}
> INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas
> in pre-migration mode. Returning NOT_CONTROLLER.
> (org.apache.kafka.controller.QuorumController)
> INFO [QuorumController id=1000] maybeFenceReplicas: event failed with
> NotControllerException in 355 microseconds. Exception message: The controller
> is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code}
> but as well requests to autocreate topics that exist already, in loop every
> 30seconds, in the last broker restarted:
> {code:java}
> INFO Sent auto-creation request for Set(_schemas) to the active controller.
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller.
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller.
> (kafka.server.DefaultAutoTopicCreationManager) {code}
> Up to the moment we have a controller still in the old cluster (in the kafka
> brokers) everything runs fine. As soon the last node is restarted the things
> are going out of the rail. This last node never gets any partition assigned
> and the cluster stays forever in with under replicated partitions. This is
> the log from the last node register that should start the migration mode, but
> the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
> {code:java}
> INFO [QuorumController id=1000] The request from broker 2 to unfence has been
> granted because it has caught up with the offset of its register broker
> record 4101
> [...]
> INFO [KRaftMigrationDriver id=1000] Ignoring image
> MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5,
> lastContainedLogTimeMs=1721133091831) which does not contain a superset of
> the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> The only way to recover the cluster is revert everything stopping clusters,
> removing /controller from zookeeper and restore the Zookeeper only
> configuration in the brokers. A cleanup of the controller is necessary too.
> The migration never starts and the controllers never understand that they
> have to migrate the data from Zookeeper. More than that, the new controller
> claims to be the CONTROLLER but it refuses to be it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)