Kafka cluster collapsed for no reason

Rybalka, Grigoriy (Fortebank) Sat, 30 Nov 2024 01:03:03 -0800

Hello! We have a 3 broker Kafka cluster ( KRaft ) brokers and Kraft controllers 
on the same nodes


CPU: 16
RAM: 32GB

We have 2241 topic and 107262 online partitions with 23652 client connections 
kafka version is 3.6.1

And yesterday we have trouble from 12:08 to 12:11
We have so many logs on all brokers indicated connection troubles ( inter node )

Here is the logs from 1st broker

438]: [2024-11-29 13:08:35,199] INFO [Partition coication.in-42 broker=0] 
Shrinking IS from 1,0,2 to 0,2. Leader: (highWatermariv
4381: [2024-11-29 13:08:35,207] INFO Partition cldkafka.out-24 broker=0] 
Shrinking [SR from 2,1,0 to 0. Leader: ChighWatermark: 655805, endOffset:

t.sh[1457438]: [2024-11-29 13:08:45,244] INFO [Partition 
communication.notificationmanager.sendnotification.in-42 broker=0] IS updated 
to 0,2 and version updated to 57 t.sh[1457438]: [2024-11-29 13:08:45,273] INFO 
[ReplicaFetcherManager on broker 0] Removed fetcher for partitions 
Set(communication.notificationmanager.sendnotification.in-t.sh[1457438]: 
[2024-11-29 13:08:45,273] INFO [Partition afka.cliaqtokafka.out-24 broker=0] IS 
updated to (c) (under-min-isr) and version updated to 59 (kafk t.sh[1457438]: 
[2024-11-29 13:08:45,460] INFO [ReplicaFetcherManager on broker 0] Removed 
fetcher for partitions Set(colvir.cliaqtokafka.cliaqtokafka.out-24) 
(kafka.servers t.sh[1457438]: [2024-11-29 13:08:53,352] INFO [GroupCoordinator 
0]: Member consumer 143-52312bed-6b44-41de-b717-9ab7c86fdaa in group 
MIB3.0_PROD has failed, removing it fi t.sh[1457438]: [2024-11-29 13:08:53,352] 
INFO [GroupCoordinator 0]: Preparing to rebalance group MIB3.0_PROD in state 
PreparingRebalance with old generation 2845 (-
_consum
t.sh [1457438]: [2024-11-29 13:08:53,352] INFO [GroupCoordinator 0]: Group hi 
with generation 2846 is now empty (__consumer_offsets-5) 
(kafka.coordinator.group.Gro t.sh[1457438]: [2024-11-29 13:08:53,718] INFO 
[Partition __consumer_offsets-5 broker=0] Shrinking ISR from 0,2,1 to 0. 
Leader: (highWatermark: 32463865, endOffset: 3246386-t.sh[1457438]: [2024-11-29 
13:08:53,747] INFO [Partition _consumer_offsets-5 broker=0] IS updated to (c) 
(under-min-isr) and version updated to 137 (kafka.cluster.Partition 
t.sh[1457438]: [2024-11-29 13:08:53,756] WARN [GroupCoordinator 0]: Failed to 
write empty metadata for group hi: The coordinator is not available. (kafka. 
coordina t.sh[1457438]: [2024-11-29 13:08:53,964] INFO [ReplicaFetcherManager 
on broker 0] Removed fetcher for partitions Set_ __consumer_offsets-5) 
(kafka.server. ReplicaFetcherMana t.sh[1457438]: [2024-11-29 13:08:53,964] INFO 
[GroupCoordinator 0]: Elected as the group coordinator for partition 5 in epoch 
12 (kafka.coordinator.group.GroupCoordinator) t.sh[1457438]: [2024-11-29 
13:08:53,964] INFO [GroupMetadataManager brokerId=0] Scheduling loading of 
offsets and group metadata from _consumer_offsets-5 for epoch 12 (kas 
t.sh[1457438]: [2024-11-29 13:08:53,965] INFO [GroupMetadataManager brokerId=0] 
Already loading offsets and group metadata from __consumer.
_offsets-5 (kafka.coordinator.gros


438]: [2024-11-29 13:08:53,718] INFO [Partition __consumer_offsets-5 broker=0] 
Shrinking IS from 0,2,1 to 0. Leader: (highWatermark: 32463865, endOffset: 
32463866). out of
438]: [2024-11-29 13:08:53,747] INFO [Partition __consumer_offsets-5 broker=0] 
IS updated to 0 (under-min-isr) and version updated to 137 
(kafka.cluster.Partition)
438]: [2024-11-29 13:08:53,756] WARN [GroupCoordinator 0]: Failed to write 
empty metadata for group hi: The coordinator is not available. 
(kafka.coordinator.groups
438]: [2024-11-29 13:08:53,964] INFO [ReplicaFetcherManager on broker 0]
Removed fetcher for partitions Set_ _consumer_offsets-5) (kafka.server. 
ReplicaFetcherManager)
438]: [2024-11-29 13:08:53,964] INFO [GroupCoordinator 0]: Elected as the group 
coordinator for partition 5 in epoch 12 
(kafka.coordinator.group.GroupCoordinator)
438]: [2024-11-29 13:08:53,964] INFO [GroupMetadataManager brokerId=0] 
Scheduling loading of offsets and group metadata from _consumer_offsets-5 for 
epoch 12 (kafka.coords
438]: [2024-11-29 13:08:53,965] INFO [GroupMetadataManager brokerId=0] Already 
loading offsets and group metadata from __consumer_offsets-5 (kafka. 
coordinator.group. Groups
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=0] Disconnecting from node 1 due to request timeout. 
(org.apache.kafka.clients.Netw
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=, leaderId=1, 
fetcherId=0] Cancelled in-flight FETCH request with correlation id 34677641 due 
to node 1 beis
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=0] Client requested connection close from node 1 
(org.apache.kafka.clients.NetworkC
4381: [2024-11-29 13:09:03, 048] INFO [ReplicaFetcher replicaId=o, leaderId-1, 
fetcherId-0] Error sending fetch request (sessionId=660052978, epoch=34677641) 
to node 1: Cora
438]: java.o.IOException: Connection to 1 was disconnected before the response 
was read
438]:
at org. 
apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)
438]:
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockinqSender.scala:113)
438]:
438]:
at kafka.server. RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79) at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:316)
438]:
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
438]:
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
438]:
at scala. Option.foreach(Option.scala:437)
438]:
438]:
at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
4381:
at kafka.server.ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
4381:
at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
438]: [2024-11-29 13:09:03,050] WARN [ReplicaFetcher replicaId=0, leaderId=1, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId-0, maxWait=2005
438]: java.io.IOException: Connection to 1 was disconnected before the response 
was read
4381:
at org. 
apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)
4381:
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
4381
at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:316)
438]
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
4381
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
438]
438]
at scala.Option.foreach(Option.scala: 437)
438]
at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
438]:
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.server. ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
4381:
at org. apache. kafka.server.util.ShutdownableThread.run(ShutdownableThread. 
java: 130)
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId-0] Disconnecting from node 2 due to request timeout. 
Corg.apache.kafka.clients.Netw
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Cancelled in-flight FETCH request with correlation id 38762087 due 
to node 2 beis
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId-2, 
fetcherId=0] Client requested connection close from node 2 (org.apache. 
kafka.clients. NetworkC
438]: [2024-11-29 13:09:05,590] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=0] Error sending fetch request (sessionId=1014113766, epoch=38762087) 
to node 2: (os
438]: lava.io. IOException: Connection to 2 was disconnected before the 
response was read
4381:
at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)



And this logs appears on remaining 2 brokers nodes, it seems that kafka brokers 
is lost connection between nodes, but ssh and other traffic to/from nodes was 
worked! And on the network side there is no problems, What else could have 
caused the loss of connection with all brokers? Kafka itself worked, we didn't 
reboot it and the problem solved itself

there is no high load on brokers on CPU,RAM,I/O but the first broker have LA 
twice bigger that other 2 nodes ( first broker is not a KRaft leader! ) but i 
don't see any problems related to kafka/kraft in the logs, only connections 
issues




???????? ????????, ??? ??? ?????????????? ? ????????? ???? ???????????? ? ???? 
???????????????? ?????? ?? ?????????? ?????? ?????, ??? ?????????? ????????? 
???? ?? ?????????? ?????????????????? ? ?????? ????? ??????, ? ????? 
????????????? ????? ?? ???????????????????? ????????? ?????????? ???????? 
??????.

Kafka cluster collapsed for no reason

Reply via email to