[ https://issues.apache.org/jira/browse/KAFKA-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tal Asulin updated KAFKA-19586: ------------------------------- Description: After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* in production, we started observing strange behavior during rolling restarts of broker nodes — behavior we had never seen before. When a broker is {*}gracefully shut down by the KRaft controller{*}, it immediately restarts. Shortly afterward, while it is busy {*}replicating missing data{*}, the broker suddenly {*}freezes for approximately 20–50 seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no heartbeat messages* to the controllers (see the timestamps below). {code:java} 2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and partition state LeaderAndIsrPartitionState(topicName='my-topic', partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], isNew=false, leaderRecoveryState=0) since it is already a follower with leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler] 2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to local followers. (state.change.logger) [kafka-8-metadata-loader-event-handler] {code} After this “hanging” period, the broker *resumes normal operation* without emitting any error or warning messages — as if nothing happened. However, during this gap, because the broker fails to send heartbeats to the KRaft controllers, it gets *fenced out of the cluster* ([9s timeout|https://kafka.apache.org/documentation/#brokerconfigs_broker.session.timeout.ms]), which leads to {*}partitions going offline and, ultimately, data loss{*}. {code:java} 2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager) [quorum-controller-300-event-handler]{code} I haven’t been able to reproduce or simulate this behavior in a test environment. However, I can confirm that it *consistently occurs in our production clusters* every time we perform a rolling restart of the brokers. Our primary suspicion is that the Kafka broker process might be *“choking” under the load* — trying to replicate a large amount of data while also taking over leadership for many partitions. This may cause the JVM to stall, leading to the observed freeze. was: After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* in production, we started observing strange behavior during rolling restarts of broker nodes — behavior we had never seen before. When a broker is {*}gracefully shut down by the KRaft controller{*}, it immediately restarts. Shortly afterward, while it is busy {*}replicating missing data{*}, the broker suddenly {*}freezes for approximately 20–50 seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no heartbeat messages* to the controllers (see the timestamps below). {code:java} 2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and partition state LeaderAndIsrPartitionState(topicName='my-topic', partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], isNew=false, leaderRecoveryState=0) since it is already a follower with leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler] 2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to local followers. (state.change.logger) [kafka-8-metadata-loader-event-handler] {code} After this “hanging” period, the broker *resumes normal operation* without emitting any error or warning messages — as if nothing happened. However, during this gap, because the broker fails to send heartbeats to the KRaft controllers, it gets {*}fenced out of the cluster{*}, which leads to {*}partitions going offline and, ultimately, data loss{*}. {code:java} 2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager) [quorum-controller-300-event-handler]{code} I haven’t been able to reproduce or simulate this behavior in a test environment. However, I can confirm that it *consistently occurs in our production clusters* every time we perform a rolling restart of the brokers. Our primary suspicion is that the Kafka broker process might be *“choking” under the load* — trying to replicate a large amount of data while also taking over leadership for many partitions. This may cause the JVM to stall, leading to the observed freeze. > Kafka broker freezes and gets fenced during rolling restart with KRaft mode > --------------------------------------------------------------------------- > > Key: KAFKA-19586 > URL: https://issues.apache.org/jira/browse/KAFKA-19586 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 3.9.0, 3.9.1, 4.0.0 > Environment: Running Kafka over Kubernetes > Kafka 3.9.0 > Reporter: Tal Asulin > Priority: Blocker > > After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* > in production, we started observing strange behavior during rolling restarts > of broker nodes — behavior we had never seen before. > When a broker is {*}gracefully shut down by the KRaft controller{*}, it > immediately restarts. Shortly afterward, while it is busy {*}replicating > missing data{*}, the broker suddenly {*}freezes for approximately 20–50 > seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no > heartbeat messages* to the controllers (see the timestamps below). > > {code:java} > 2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state > change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and > partition state LeaderAndIsrPartitionState(topicName='my-topic', > partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], > partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], > isNew=false, leaderRecoveryState=0) since it is already a follower with > leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler] > 2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to > local followers. (state.change.logger) > [kafka-8-metadata-loader-event-handler] {code} > > After this “hanging” period, the broker *resumes normal operation* without > emitting any error or warning messages — as if nothing happened. However, > during this gap, because the broker fails to send heartbeats to the KRaft > controllers, it gets *fenced out of the cluster* ([9s > timeout|https://kafka.apache.org/documentation/#brokerconfigs_broker.session.timeout.ms]), > which leads to {*}partitions going offline and, ultimately, data loss{*}. > {code:java} > 2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 > because its session has timed out. > (org.apache.kafka.controller.ReplicationControlManager) > [quorum-controller-300-event-handler]{code} > > I haven’t been able to reproduce or simulate this behavior in a test > environment. However, I can confirm that it *consistently occurs in our > production clusters* every time we perform a rolling restart of the brokers. > Our primary suspicion is that the Kafka broker process might be *“choking” > under the load* — trying to replicate a large amount of data while also > taking over leadership for many partitions. This may cause the JVM to stall, > leading to the observed freeze. -- This message was sent by Atlassian Jira (v8.20.10#820010)