Anthony Lazam created KAFKA-7876: ------------------------------------ Summary: Broker suddenly got disconnected Key: KAFKA-7876 URL: https://issues.apache.org/jira/browse/KAFKA-7876 Project: Kafka Issue Type: Bug Components: controller, network Affects Versions: 2.1.0 Reporter: Anthony Lazam Attachments: kafka-issue.png
We have 3 node cluster setup. There are scenarios that one of the broker suddenly got disconnected from the cluster but no underlying system issue is found. The node that got dc'ed wasn't able to release the partition it holds as the leader, hence clients (spring-boot) was unable to send/receive data from the issued broker. We noticed that it always happen to the active controller count. Environment details: Provider: AWS Kernel: 3.10.0-693.21.1.el7.x86_64 OS: CentOS Linux release 7.5.1804 (Core) Scala version: 2.11 Kafka version: 2.1.0 Kafka config: {code:java} ############################# Socket Server Settings ############################# num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 ############################# Log Basics ############################# num.partitions=1 num.recovery.threads.per.data.dir=1 ############################# Internal Topic Settings ############################# offsets.topic.replication.factor=3 transaction.state.log.replication.factor=3 transaction.state.log.min.isr=2 ############################# Log Retention Policy ############################# log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 ############################# Group Coordinator Settings ############################# group.initial.rebalance.delay.ms=0 ############################# Zookeeper ############################# zookeeper.connection.timeout.ms=6000 broker.id=1 zookeeper.connect=zk1:2181,zk2:2181,zk3:2181 log.dirs=/data/kafka-node advertised.listeners=PLAINTEXT://node1:9092 {code} Broker disconnected controller log: {code:java} [2019-01-26 05:03:52,512] TRACE [Controller id=2] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Preferred replicas by broker Map(TOPICS->MAP) (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred replica for broker 2 Map() (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred replica for broker 1 Map() (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred replica for broker 3 Map() (kafka.controller.KafkaController) [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for broker 3 is 0.0 (kafka.controller.KafkaController) [2019-01-26 05:08:52,513] TRACE [Controller id=2] Checking need to trigger auto leader balancing (kafka.controller.KafkaController) {code} Broker working server.log: {code:java} [2019-01-26 05:02:05,564] INFO [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=21379644) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler) [2019-01-26 05:02:05,573] WARN [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={PlayerGameRounds-8=(offset=2171960, logStartOffset=1483356, maxBytes=1048576, currentLeaderEpoch=Optional[2])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessio nId=1637095899, epoch=21379644)) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 2 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-01-26 05:02:35,723] WARN Attempting to send response via channel for which there is no open connection, connection id node3:9092-node2:59988-1550 (kafka.network.Processor) [2019-01-26 05:02:35,731] WARN Attempting to send response via channel for which there is no open connection, connection id node3:9092-node2:59986-1550 (kafka.network.Processor) [2019-01-26 05:02:35,797] WARN Attempting to send response via channel for which there is no open connection, connection id node3:9092-node2:59494-1549 (kafka.network.Processor) [2019-01-26 05:02:35,816] WARN Attempting to send response via channel for which there is no open connection, connection id node3:9092-node2:53268-1530 (kafka.network.Processor) [2019-01-26 05:02:37,603] INFO [ReplicaFetcher replicaId=3, leaderId=2, fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)