[jira] [Commented] (KAFKA-18871) KRaft migration rollback causes downtime

Paolo Patierno (Jira) Thu, 27 Feb 2025 03:46:39 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931126#comment-17931126
 ]


Paolo Patierno commented on KAFKA-18871:
----------------------------------------

I had a look at the operator log and it looks like it's doing what has to be 
done.

The brokers need to be rolled and of course it's trying to start from broker 0. 
The issue is that the broker 1 is out of sync. As you can see here:
{code:java}
2025-02-25 11:45:31 DEBUG KafkaAvailability:101 - Reconciliation #10971(watch) 
Kafka(csm-op-test-kraft-rollback-f19bca6a/kraft-rollback-kafka): 
kraft-test-topic has min.insync.replicas=2.
2025-02-25 11:45:31 INFO  KafkaAvailability:135 - Reconciliation #10971(watch) 
Kafka(csm-op-test-kraft-rollback-f19bca6a/kraft-rollback-kafka): 
kraft-test-topic/2 will be under-replicated (ISR={0,2}, replicas=[0,1,2], 
min.insync.replicas=2) if broker 0 is restarted.
2025-02-25 11:45:31 DEBUG KafkaAvailability:86 - Reconciliation #10971(watch) 
Kafka(csm-op-test-kraft-rollback-f19bca6a/kraft-rollback-kafka): Restart pod 0 
would remove it from ISR, stalling producers with acks=all {code}
so the operator is preventing from rolling broker 0 because otherwise you would 
have under replicated partition.

This can happen in general and not strictly related to a rollback migration 
operation. It could even happen when you apply a broker config change which 
needs a rolling of all brokers but the operator can't roll because of the 
possibility of under replicated partitions.

At some point AFAICS, the broker 0 is rolled because broker 1 seems to catch 
up. All three brokers have ZooKeeper connect configuration as they are using 
the ZooKeeper ensemble again.

> KRaft migration rollback causes downtime
> ----------------------------------------
>
>                 Key: KAFKA-18871
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18871
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft, migration
>    Affects Versions: 3.9.0
>            Reporter: Daniel Urban
>            Priority: Critical
>         Attachments: kraft-rollback-bug.zip
>
>
> When testing the KRaft migration rollback feature, found the following 
> scenario:
>  # Execute KRaft migration on a 3 broker 3 ZK node cluster to the last step, 
> but do not finalize the migration.
>  ## In the test, we put a slow but continuous produce+consume load on the 
> cluster, with a topic (partitions=3, RF=3, min ISR=2)
>  # Start the rollback procedure
>  # First we roll back the brokers from KRaft mode to migration mode (both 
> controller and ZK configs are set, process roles are removed, 
> {{zookeeper.metadata.migration.enable}} is true)
>  # Then we delete the KRaft controllers, delete the /controller znode
>  # Then we immediately start rolling the brokers 1 by 1 to ZK mode by 
> removing the {{zookeeper.metadata.migration.enable}} flag and the 
> controller.* configurations.
>  # At this point, when we restart the 1st broker (let's call it broker 0) 
> into ZK mode, we find an issue which occurs ~1 out of 20 times:
> If broker 0 is not in the ISR for one of the partitions at this point, it can 
> simply never become part of the ISR. Since we are aiming for zero downtime, 
> we check the ISR states of partitions between broker restarts, and our 
> process gets blocked at this point. We have tried multiple workarounds at 
> this point, but it seems that there is no workaround which still ensures zero 
> downtime.
> Some more details about the process:
>  * We are using Strimzi to drive this process, but have verified that Strimzi 
> follows the documented steps precisely.
>  * When we reach the error state, it doesn't matter which broker became the 
> controller through the ZK node, the brokers still in migration mode get 
> stuck, and they flood the logs with the following error:
> {code:java}
> 2025-02-26 10:55:21,985 WARN [RaftManager id=0] Error connecting to node 
> kraft-rollback-kafka-controller-pool-5.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-e7798bef.svc.cluster.local:9090
>  (id: 5 rack: null) (org.apache.kafka.clients.NetworkClient) 
> [kafka-raft-outbound-request-thread]
> java.net.UnknownHostException: 
> kraft-rollback-kafka-controller-pool-5.kraft-rollback-kafka-kafka-brokers.csm-op-test-kraft-rollback-e7798bef.svc.cluster.local
>         at 
> java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
>         at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
>         at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
>         at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
>         at 
> org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
>         at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:125)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.resolveAddresses(ClusterConnectionStates.java:536)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:511)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:466)
>         at 
> org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:173)
>         at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1075)
>         at 
> org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:321)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.sendRequests(InterBrokerSendThread.java:146)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.pollOnce(InterBrokerSendThread.java:109)
>         at 
> org.apache.kafka.server.util.InterBrokerSendThread.doWork(InterBrokerSendThread.java:137)
>         at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
>  {code}
>  * Manually verified the last offsets of the replicas, and broker 0 is caught 
> up in the partition.
>  * Even after stopping the produce load, the issue persists.
>  * Even after removing the /controller node manually (to retrigger election), 
> regardless of which broker becomes the controller, the issue persists.
> Based on the above, it seems that during the rollback, brokers in migration 
> mode cannot handle the KRaft controllers being removed from the system. Since 
> broker 0 is caught up in the partition, we suspect that the other brokers 
> (still in migration mode) do not respect the controller state in ZK, and do 
> not report changes in the ISR of the partitions they are leading.
> This means that if a replica becomes out of sync in the last restart (e.g. 
> due to a slow broker restart), we cannot restart the brokers while ensuring 
> zero downtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18871) KRaft migration rollback causes downtime

Reply via email to