[ 
https://issues.apache.org/jira/browse/KAFKA-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tal Asulin updated KAFKA-19586:
-------------------------------
    Description: 
After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* 
in production, we started observing strange behavior during rolling restarts of 
broker nodes — behavior we had never seen before.

When a broker is {*}gracefully shut down by the KRaft controller{*}, it 
immediately restarts. Shortly afterward, while it is busy {*}replicating 
missing data{*}, the broker suddenly {*}freezes for approximately 20–50 
seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no 
heartbeat messages* to the controllers (see the timestamps below).

 
{code:java}
2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state 
change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and 
partition state LeaderAndIsrPartitionState(topicName='my-topic', 
partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], 
partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], 
isNew=false, leaderRecoveryState=0) since it is already a follower with leader 
epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]

2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to 
local followers. (state.change.logger) [kafka-8-metadata-loader-event-handler]  
     {code}
 

After this “hanging” period, the broker *resumes normal operation* without 
emitting any error or warning messages — as if nothing happened. However, 
during this gap, because the broker fails to send heartbeats to the KRaft 
controllers, it gets *fenced out of the cluster* ([9s 
timeout|https://kafka.apache.org/documentation/#brokerconfigs_broker.session.timeout.ms]),
 which leads to {*}partitions going offline and, ultimately, data loss{*}.
{code:java}
2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because 
its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-300-event-handler]{code}
 

I haven’t been able to reproduce or simulate this behavior in a test 
environment. However, I can confirm that it *consistently occurs in our 
production clusters* every time we perform a rolling restart of the brokers.

Our primary suspicion is that the Kafka broker process might be *“choking” 
under the load* — trying to replicate a large amount of data while also taking 
over leadership for many partitions. This may cause the JVM to stall, leading 
to the observed freeze.

  was:
After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* 
in production, we started observing strange behavior during rolling restarts of 
broker nodes — behavior we had never seen before.

When a broker is {*}gracefully shut down by the KRaft controller{*}, it 
immediately restarts. Shortly afterward, while it is busy {*}replicating 
missing data{*}, the broker suddenly {*}freezes for approximately 20–50 
seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no 
heartbeat messages* to the controllers (see the timestamps below).

 
{code:java}
2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state 
change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and 
partition state LeaderAndIsrPartitionState(topicName='my-topic', 
partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], 
partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], 
isNew=false, leaderRecoveryState=0) since it is already a follower with leader 
epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]

2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to 
local followers. (state.change.logger) [kafka-8-metadata-loader-event-handler]  
     {code}
 

After this “hanging” period, the broker *resumes normal operation* without 
emitting any error or warning messages — as if nothing happened. However, 
during this gap, because the broker fails to send heartbeats to the KRaft 
controllers, it gets {*}fenced out of the cluster{*}, which leads to 
{*}partitions going offline and, ultimately, data loss{*}.
{code:java}
2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because 
its session has timed out. 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-300-event-handler]{code}
 

I haven’t been able to reproduce or simulate this behavior in a test 
environment. However, I can confirm that it *consistently occurs in our 
production clusters* every time we perform a rolling restart of the brokers.

Our primary suspicion is that the Kafka broker process might be *“choking” 
under the load* — trying to replicate a large amount of data while also taking 
over leadership for many partitions. This may cause the JVM to stall, leading 
to the observed freeze.


> Kafka broker freezes and gets fenced during rolling restart with KRaft mode
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-19586
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19586
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.9.0, 3.9.1, 4.0.0
>         Environment: Running Kafka over Kubernetes
> Kafka 3.9.0
>            Reporter: Tal Asulin
>            Priority: Blocker
>
> After upgrading our Kafka clusters to *Kafka 3.9.0* with *KRaft mode enabled* 
> in production, we started observing strange behavior during rolling restarts 
> of broker nodes — behavior we had never seen before.
> When a broker is {*}gracefully shut down by the KRaft controller{*}, it 
> immediately restarts. Shortly afterward, while it is busy {*}replicating 
> missing data{*}, the broker suddenly {*}freezes for approximately 20–50 
> seconds{*}. During this time, it produces {*}no logs, no metrics{*}, and *no 
> heartbeat messages* to the controllers (see the timestamps below).
>  
> {code:java}
> 2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state 
> change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and 
> partition state LeaderAndIsrPartitionState(topicName='my-topic', 
> partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], 
> partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], 
> isNew=false, leaderRecoveryState=0) since it is already a follower with 
> leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]
> 2025-07-30 09:21:51,887 INFO [Broker id=8] Transitioning 427 partition(s) to 
> local followers. (state.change.logger) 
> [kafka-8-metadata-loader-event-handler]       {code}
>  
> After this “hanging” period, the broker *resumes normal operation* without 
> emitting any error or warning messages — as if nothing happened. However, 
> during this gap, because the broker fails to send heartbeats to the KRaft 
> controllers, it gets *fenced out of the cluster* ([9s 
> timeout|https://kafka.apache.org/documentation/#brokerconfigs_broker.session.timeout.ms]),
>  which leads to {*}partitions going offline and, ultimately, data loss{*}.
> {code:java}
> 2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 
> because its session has timed out. 
> (org.apache.kafka.controller.ReplicationControlManager) 
> [quorum-controller-300-event-handler]{code}
>  
> I haven’t been able to reproduce or simulate this behavior in a test 
> environment. However, I can confirm that it *consistently occurs in our 
> production clusters* every time we perform a rolling restart of the brokers.
> Our primary suspicion is that the Kafka broker process might be *“choking” 
> under the load* — trying to replicate a large amount of data while also 
> taking over leadership for many partitions. This may cause the JVM to stall, 
> leading to the observed freeze.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to