[ 
https://issues.apache.org/jira/browse/KAFKA-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Henneaux updated KAFKA-16883:
-------------------------------------
    Description: 
Despite several attempts to migrate from Zookeeper cluster to Kraft, it failed 
to properly migrate.

We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 
Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.
It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0.

The controllers are started without issue. When the brokers are then configured 
for the migration, the migration is not starting. Once the last broker is 
restarted, we got the following logs.
{code:java}
[2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped 
(kafka.server.ReplicaFetcherThread)
[2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
Then we only get the following every 30s
{code:java}
[2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager){code}

The config on the controller node is the following
{code:java}
kafka0202e1 ~]$  sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | 
grep -v password | sort
advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
broker.rack=e1
controller.listener.names=CONTROLLER
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
log.dirs=/data/kafka
log.message.format.version=3.6
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=20
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
process.roles=controller
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
zookeeper.metadata.migration.enable=true
 {code}

The config on the broker node is the following
{code}
$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | grep -v password 
| sort
advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
broker.id=12
broker.rack=e3
controller.listener.names=CONTROLLER # added once all controllers were started
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
 # added once all controllers were started
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
log.dirs=/data/kafka
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
zookeeper.connection.timeout.ms=6000
zookeeper.metadata.migration.enable=true # added once all controllers were 
started
{code}

When trying to move to the next step, it fails to get controller quorum and 
crashes.
{code}
[2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to 
register the broker because the RPC got timed out before it could be sent. 
(kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down 
because we were unable to register with the controller quorum. 
(kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning 
from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:33:32,551] INFO 
[broker-12-to-controller-heartbeat-channel-manager]: Shutting down 
(kafka.server.NodeToControllerRequestThread)
[2024-06-03 15:33:32,551] INFO 
[broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed 
(kafka.server.NodeToControllerRequestThread)
[2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error 
while waiting for the controller to acknowledge that we are caught up 
(kafka.server.BrokerServer)
java.util.concurrent.CancellationException
{code}

  was:
Despite several attempts to migrate from Zookeeper cluster to Kraft, it failed 
to properly migrate.

We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 
Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.
It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0.

The controllers are started without issue. When the brokers are then configured 
for the migration, the migration is not starting. Once the last broker is 
restarted, we got the following logs.
{code:java}
[2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped 
(kafka.server.ReplicaFetcherThread)
[2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
Then we only get the following every 30s
{code:java}
[2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager)
[2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
Unable to register the broker because the RPC got timed out before it could be 
sent. (kafka.server.BrokerLifecycleManager){code}

The config on the controller node is the following
{code:java}
kafka0202e1 ~]$  sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | 
grep -v password | sort
advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
broker.rack=e1
controller.listener.names=CONTROLLER
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
log.dirs=/data/kafka
log.message.format.version=3.6
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
node.id=20
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
process.roles=controller
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
zookeeper.metadata.migration.enable=true
 {code}

The config on the broker node is the following
{code}
$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | grep -v password 
| sort
advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
broker.id=12
broker.rack=e3
controller.listener.names=CONTROLLER # added once all controllers were started
controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
 # added once all controllers were started
default.replication.factor=3
delete.topic.enable=false
group.initial.rebalance.delay.ms=3000
inter.broker.protocol.version=3.7
listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
log.dirs=/data/kafka
log.retention.check.interval.ms=300000
log.retention.hours=240
log.segment.bytes=1073741824
min.insync.replicas=2
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
security.inter.broker.protocol=SSL
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.cipher.suites=TLS_AES_256_GCM_SHA384
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3
ssl.endpoint.identification.algorithm=HTTPS
ssl.keystore.location=/etc/kafka/ssl/keystore.ts
ssl.keystore.type=JKS
ssl.secure.random.implementation=SHA1PRNG
ssl.truststore.location=/etc/kafka/ssl/truststore.ts
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
unclean.leader.election.enable=false
zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
zookeeper.connection.timeout.ms=6000
zookeeper.metadata.migration.enable=true # added once all controllers were 
started
{code}


> Zookeeper-Kraft failing migration - RPC got timed out before it could be sent
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-16883
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16883
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.7.0, 3.6.1, 3.6.2
>            Reporter: Nicolas Henneaux
>            Priority: Major
>
> Despite several attempts to migrate from Zookeeper cluster to Kraft, it 
> failed to properly migrate.
> We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 
> Zookeeper nodes. 3 new Kafka nodes are there for the new controllers.
> It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0.
> The controllers are started without issue. When the brokers are then 
> configured for the migration, the migration is not starting. Once the last 
> broker is restarted, we got the following logs.
> {code:java}
> [2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped 
> (kafka.server.ReplicaFetcherThread)
> [2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown 
> completed (kafka.server.ReplicaFetcherThread)
> {code}
> Then we only get the following every 30s
> {code:java}
> [2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] 
> Unable to register the broker because the RPC got timed out before it could 
> be sent. (kafka.server.BrokerLifecycleManager){code}
> The config on the controller node is the following
> {code:java}
> kafka0202e1 ~]$  sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | 
> grep -v password | sort
> advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net
> broker.rack=e1
> controller.listener.names=CONTROLLER
> controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
> default.replication.factor=3
> delete.topic.enable=false
> group.initial.rebalance.delay.ms=3000
> inter.broker.protocol.version=3.7
> listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093
> listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> log.dirs=/data/kafka
> log.message.format.version=3.6
> log.retention.check.interval.ms=300000
> log.retention.hours=240
> log.segment.bytes=1073741824
> min.insync.replicas=2
> node.id=20
> num.io.threads=8
> num.network.threads=3
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> offsets.topic.replication.factor=3
> process.roles=controller
> security.inter.broker.protocol=SSL
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> ssl.cipher.suites=TLS_AES_256_GCM_SHA384
> ssl.client.auth=required
> ssl.enabled.protocols=TLSv1.3
> ssl.endpoint.identification.algorithm=HTTPS
> ssl.keystore.location=/etc/kafka/ssl/keystore.ts
> ssl.keystore.type=JKS
> ssl.secure.random.implementation=SHA1PRNG
> ssl.truststore.location=/etc/kafka/ssl/truststore.ts
> transaction.state.log.min.isr=3
> transaction.state.log.replication.factor=3
> unclean.leader.election.enable=false
> zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181,
> zookeeper.metadata.migration.enable=true
>  {code}
> The config on the broker node is the following
> {code}
> $ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties  | grep -v 
> password | sort
> advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net
> advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
> broker.id=12
> broker.rack=e3
> controller.listener.names=CONTROLLER # added once all controllers were started
> controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093
>  # added once all controllers were started
> default.replication.factor=3
> delete.topic.enable=false
> group.initial.rebalance.delay.ms=3000
> inter.broker.protocol.version=3.7
> listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092
> log.dirs=/data/kafka
> log.retention.check.interval.ms=300000
> log.retention.hours=240
> log.segment.bytes=1073741824
> min.insync.replicas=2
> num.io.threads=8
> num.network.threads=3
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> offsets.topic.replication.factor=3
> security.inter.broker.protocol=SSL
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> ssl.cipher.suites=TLS_AES_256_GCM_SHA384
> ssl.client.auth=required
> ssl.enabled.protocols=TLSv1.3
> ssl.endpoint.identification.algorithm=HTTPS
> ssl.keystore.location=/etc/kafka/ssl/keystore.ts
> ssl.keystore.type=JKS
> ssl.secure.random.implementation=SHA1PRNG
> ssl.truststore.location=/etc/kafka/ssl/truststore.ts
> transaction.state.log.min.isr=3
> transaction.state.log.replication.factor=3
> unclean.leader.election.enable=false
> zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181,
> zookeeper.connection.timeout.ms=6000
> zookeeper.metadata.migration.enable=true # added once all controllers were 
> started
> {code}
> When trying to move to the next step, it fails to get controller quorum and 
> crashes.
> {code}
> [2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to 
> register the broker because the RPC got timed out before it could be sent. 
> (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down 
> because we were unable to register with the controller quorum. 
> (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning 
> from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager)
> [2024-06-03 15:33:32,551] INFO 
> [broker-12-to-controller-heartbeat-channel-manager]: Shutting down 
> (kafka.server.NodeToControllerRequestThread)
> [2024-06-03 15:33:32,551] INFO 
> [broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed 
> (kafka.server.NodeToControllerRequestThread)
> [2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error 
> while waiting for the controller to acknowledge that we are caught up 
> (kafka.server.BrokerServer)
> java.util.concurrent.CancellationException
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to