[jira] [Commented] (KAFKA-18871) KRaft migration rollback causes downtime

Daniel Urban (Jira) Mon, 03 Mar 2025 08:18:05 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932017#comment-17932017
 ]


Daniel Urban commented on KAFKA-18871:
--------------------------------------

[~showuon] Managed to reproduce with debug logs, attached new files. If 
checking the operator logs, look for the csm-op-test-kraft-rollback-89a41d95 
namespace.

>From the logs you were looking for, found some like this:

 
{code:java}
2025-03-03 15:28:55,343 INFO [Partition kraft-test-topic-1 broker=0] Failed to 
alter partition to PendingExpandIsr(newInSyncReplicaId=1, 
sentLeaderAndIsr=LeaderAndIsr(leader=0, leaderEpoch=4, 
isrWithBrokerEpoch=List(BrokerState(brokerId=2, brokerEpoch=-1), 
BrokerState(brokerId=0, brokerEpoch=-1), BrokerState(brokerId=1, 
brokerEpoch=-1)), leaderRecoveryState=RECOVERED, partitionEpoch=7), 
leaderRecoveryState=RECOVERED, 
lastCommittedState=CommittedPartitionState(isr=Set(2, 0), 
leaderRecoveryState=RECOVERED)) since the controller rejected the request with 
INELIGIBLE_REPLICA. Partition state has been reset to the latest committed 
state CommittedPartitionState(isr=Set(2, 0), leaderRecoveryState=RECOVERED). 
(kafka.cluster.Partition) 
[zk-broker-0-to-controller-alter-partition-channel-manager] {code}
 

 

> KRaft migration rollback causes downtime
> ----------------------------------------
>
>                 Key: KAFKA-18871
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18871
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft, migration
>    Affects Versions: 3.9.0
>            Reporter: Daniel Urban
>            Priority: Critical
>         Attachments: cluster-operator.log.zip, controller_logs.zip, 
> kraft-rollback-bug.zip, kraft-rollback-kafka-default-pool-0.zip, 
> kraft-rollback-kafka-default-pool-1.zip, 
> kraft-rollback-kafka-default-pool-2.zip, 
> strimzi-cluster-operator-7bc47d488f-p4ltv.zip
>
>
> When testing the KRaft migration rollback feature, found the following 
> scenario:
>  # Execute KRaft migration on a 3 broker 3 ZK node cluster to the last step, 
> but do not finalize the migration.
>  ## In the test, we put a slow but continuous produce+consume load on the 
> cluster, with a topic (partitions=3, RF=3, min ISR=2)
>  # Start the rollback procedure
>  # First we roll back the brokers from KRaft mode to migration mode (both 
> controller and ZK configs are set, process roles are removed, 
> {{zookeeper.metadata.migration.enable}} is true)
>  # Then we delete the KRaft controllers, delete the /controller znode
>  # Then we immediately start rolling the brokers 1 by 1 to ZK mode by 
> removing the {{zookeeper.metadata.migration.enable}} flag and the 
> controller.* configurations.
>  # At this point, when we restart the 1st broker (let's call it broker 0) 
> into ZK mode, we find an issue which occurs ~1 out of 20 times:
> If broker 0 is not in the ISR for one of the partitions at this point, it can 
> simply never become part of the ISR. Since we are aiming for zero downtime, 
> we check the ISR states of partitions between broker restarts, and our 
> process gets blocked at this point. We have tried multiple workarounds at 
> this point, but it seems that there is no workaround which still ensures zero 
> downtime.
> Some more details about the process:
>  * We are using Strimzi to drive this process, but have verified that Strimzi 
> follows the documented steps precisely.
>  * When we reach the error state, it doesn't matter which broker became the 
> controller through the ZK node, the brokers still in migration mode get 
> stuck, and they flood the logs with the following error:
> {code:java}
> 2025-02-26 10:55:21,985 WARN [RaftManager id=0] Error connecting to node 
> kraft-rollback-kafka-controller-pool-5.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-e7798bef.svc.cluster.local:9090
>  (id: 5 rack: null) (org.apache.kafka.clients.NetworkClient) 
> [kafka-raft-outbound-request-thread]
> java.net.UnknownHostException: 
> kraft-rollback-kafka-controller-pool-5.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-e7798bef.svc.cluster.local
>         at 
> java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
>         at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
>         at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
>         at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
>         at 
> org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
>         at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:125)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.resolveAddresses(ClusterConnectionStates.java:536)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:511)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:466)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:173)
>         at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1075)
>         at 
> org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:321)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.sendRequests(InterBrokerSendThread.java:146)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.pollOnce(InterBrokerSendThread.java:109)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.doWork(InterBrokerSendThread.java:137)
>         at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
>  {code}
>  * Manually verified the last offsets of the replicas, and broker 0 is caught 
> up in the partition.
>  * Even after stopping the produce load, the issue persists.
>  * Even after removing the /controller node manually (to retrigger election), 
> regardless of which broker becomes the controller, the issue persists.
> Based on the above, it seems that during the rollback, brokers in migration 
> mode cannot handle the KRaft controllers being removed from the system. Since 
> broker 0 is caught up in the partition, we suspect that the other brokers 
> (still in migration mode) do not respect the controller state in ZK, and do 
> not report changes in the ISR of the partitions they are leading.
> This means that if a replica becomes out of sync in the last restart (e.g. 
> due to a slow broker restart), we cannot restart the brokers while ensuring 
> zero downtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18871) KRaft migration rollback causes downtime

Reply via email to