Kafka services are unstable in 3 node cluster

Sravani Wed, 12 Mar 2025 06:51:27 -0700

Hi Team,

We are facing issue where kafka services are unstable in 3 node cluser
setup. We are seeing continuously leader election is failing and
reselecting from the logs. Below are the logs.


Please help us to understand the issue.

Feb 27 22:19:33 localhost kafka[835188]: [2025-02-27 20:19:33,344]
INFO [RaftManager
id=2] Did not receive fetch request from the majority of the voters within
3000ms. Current fetched voters are []. (org.apache.kafka.raft.LeaderState)
Feb 27 22:19:33 localhost kafka[835188]: [2025-02-27 20:19:33,345]
INFO [RaftManager
id=2] Completed transition to ResignedState(localId=2, epoch=29, voters=[1,
2, 3], electionTimeoutMs=1122, unackedVoters=[1, 3], preferredSuccessors=[1,
3]) from Leader(localId=2, epoch=29, epochStartOffset=39221,
highWatermark=Optional.empty, voterStates={1=ReplicaState(nodeId=1,
endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1,
hasAcknowledgedLeader=true), 2=ReplicaState(nodeId=2,
endOffset=Optional[LogOffsetMetadata(offset=39222, metadata=Optional
[(segmentBaseOffset=0,relativePositionInSegment=2821418)])],
lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1,
hasAcknowledgedLeader=true), 3=ReplicaState(nodeId=3,
endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1,
hasAcknowledgedLeader=false)}) (org.apache.kafka.raft.QuorumState)
Feb 27 22:19:33 localhost kafka[835188]: [2025-02-27 20:19:33,352]
INFO [RaftManager
id=2] Completed transition to Unattached(epoch=30, voters=[1, 2, 3],
electionTimeoutMs=1399) from ResignedState(localId=2, epoch=29, voters=[1,
2, 3], electionTimeoutMs=1122, unackedVoters=[1, 3], preferredSuccessors=[1,
3]) (org.apache.kafka.raft.QuorumState)
Feb 27 22:19:33 localhost kafka[835188]: [2025-02-27 20:19:33,359]
INFO [MetadataLoader
id=2] initializeNewPublishers: the loader is still catching up because we
still don't know the high water mark yet.
(org.apache.kafka.image.loader.MetadataLoader)
Feb 27 22:20:13 localhost kafka[835188]: [2025-02-27 20:20:13,623]
INFO [RaftManager
id=2] Election has timed out, backing off for 200ms before becoming a
candidate again (org.apache.kafka.raft.KafkaRaftClient)
Feb 27 22:20:29 localhost kafka[835188]: [2025-02-27 20:20:29,005]
ERROR [BrokerServer
id=2] Fatal error during broker startup. Prepare to shutdown
(kafka.server.BrokerServer)
Feb 27 22:20:29 localhost kafka[835188]: java.lang.RuntimeException:
Received a fatal error while waiting for the controller to acknowledge that
we are caught up
Feb 27 22:20:29 localhost kafka[835188]: #011at
org.apache.kafka.server.util.FutureUtils.waitWithLogging(FutureUtils.java:72)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.server.BrokerServer.startup(BrokerServer.scala:508)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.server.KafkaRaftServer.$anonfun$startup$2(KafkaRaftServer.scala:97)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.server.KafkaRaftServer.$anonfun$startup$2$adapted(KafkaRaftServer.scala:97)
Feb 27 22:20:29 localhost kafka[835188]: #011at
scala.Option.foreach(Option.scala:437)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.server.KafkaRaftServer.startup(KafkaRaftServer.scala:97)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.Kafka$.main(Kafka.scala:112)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.Kafka.main(Kafka.scala)
Feb 27 22:20:29 localhost kafka[835188]: Caused by:
java.util.concurrent.CancellationException
Feb 27 22:20:29 localhost kafka[835188]: #011at
java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.server.BrokerLifecycleManager$ShutdownEvent.run(BrokerLifecycleManager.scala:638)
Feb 27 22:20:29 localhost kafka[835188]: #011at
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:190)
Feb 27 22:20:29 localhost kafka[835188]: #011at
java.base/java.lang.Thread.run(Thread.java:840)
Feb 27 22:20:29 localhost kafka[835188]: [2025-02-27 20:20:29,006]
INFO [BrokerServer
id=2] Transition from STARTED to SHUTTING_DOWN (kafka.server.BrokerServer)
Feb 27 22:20:29 localhost kafka[835188]: [2025-02-27 20:20:29,010] ERROR
Cannot invoke "java.nio.channels.ServerSocketChannel.close()" because the
return value of "kafka.network.Acceptor.serverChannel()" is null
(kafka.network.DataPlaneAcceptor)
Feb 27 22:20:29 localhost kafka[835188]: java.lang.NullPointerException:
Cannot invoke "java.nio.channels.ServerSocketChannel.close()" because the
return value of "kafka.network.Acceptor.serverChannel()" is null
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.network.Acceptor.$anonfun$closeAll$2(SocketServer.scala:712)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.utils.CoreUtils$.swallow(CoreUtils.scala:68)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.network.Acceptor.closeAll(SocketServer.scala:712)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.network.Acceptor.close(SocketServer.scala:679)
Feb 27 22:20:29 localhost kafka[835188]: #011at
kafka.network.SocketServer.$anonfun$stopProcessingRequests$4(Socket
Feb 27 22:22:44 localhost kafka[841117]: [2025-02-27 20:22:44,151]
WARN [RaftManager
id=2] Graceful shutdown timed out after 5000ms
(org.apache.kafka.raft.KafkaRaftClient)
Feb 27 22:22:44 localhost kafka[841117]: [2025-02-27 20:22:44,151]
ERROR [RaftManager
id=2] Graceful shutdown of RaftClient failed
(org.apache.kafka.raft.KafkaRaftClientDriver)

Please prioritise this issue and let us know.

*We have already created ticket
- https://issues.apache.org/jira/browse/KAFKA-18958
<https://issues.apache.org/jira/browse/KAFKA-18958>*


Thanks,

Sravani

Kafka services are unstable in 3 node cluster

Reply via email to