[ https://issues.apache.org/jira/browse/KAFKA-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dave Powell updated KAFKA-3916: ------------------------------- Description: We recently upgraded from 0.8.2.1 to 0.9.0.1. Since then, several times per day, the controllers in our clusters have their connection to all brokers disconnected, and then successfully reconnected a few hundred ms later. Each time this occurs we see a brief spike in our 99th percentile produce and consume times, reaching several hundred ms. Here is an example of what we're seeing in the controller.log: {{[2016-06-28 14:15:35,416] WARN [Controller-151-to-broker-160-send-thread], Controller 151 epoch 106 fails to send request {…} to broker Node(160, broker.160.hostname, 9092). Reconnecting to broker. (kafka.controller.RequestSendThread) java.io.IOException: Connection to 160 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84) at scala.Option.foreach(Option.scala:236) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80) at kafka.utils.NetworkClientBlockingOps$.recurse$1(NetworkClientBlockingOps.scala:129) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntilFound$extension(NetworkClientBlockingOps.scala:139) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:180) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:171) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) ... one each for all brokers (including the controller) ... [2016-06-28 14:15:35,721] INFO [Controller-151-to-broker-160-send-thread], Controller 151 connected to Node(160, broker.160.hostname, 9092) for sending state change requests (kafka.controller.RequestSendThread) … one each for all brokers (including the controller) ...}} was: We recently upgraded from 0.8.2.1 to 0.9.0.1. Since then, several times per day, the controllers in our clusters have their connection to all brokers disconnected, and then successfully reconnected a few hundred ms later. Each time this occurs we see a brief spike in our 99th percentile produce and consume times, reaching several hundred ms. Here is an example of what we're seeing in the controller.log: {{ [2016-06-28 14:15:35,416] WARN [Controller-151-to-broker-160-send-thread], Controller 151 epoch 106 fails to send request {…} to broker Node(160, broker.160.hostname, 9092). Reconnecting to broker. (kafka.controller.RequestSendThread) java.io.IOException: Connection to 160 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84) at scala.Option.foreach(Option.scala:236) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80) at kafka.utils.NetworkClientBlockingOps$.recurse$1(NetworkClientBlockingOps.scala:129) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntilFound$extension(NetworkClientBlockingOps.scala:139) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:180) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:171) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) ... one each for all brokers (including the controller) ... [2016-06-28 14:15:35,721] INFO [Controller-151-to-broker-160-send-thread], Controller 151 connected to Node(160, broker.160.hostname, 9092) for sending state change requests (kafka.controller.RequestSendThread) … one each for all brokers (including the controller) ... }} > Connection from controller to broker disconnects > ------------------------------------------------ > > Key: KAFKA-3916 > URL: https://issues.apache.org/jira/browse/KAFKA-3916 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.9.0.1 > Reporter: Dave Powell > > We recently upgraded from 0.8.2.1 to 0.9.0.1. Since then, several times per > day, the controllers in our clusters have their connection to all brokers > disconnected, and then successfully reconnected a few hundred ms later. Each > time this occurs we see a brief spike in our 99th percentile produce and > consume times, reaching several hundred ms. > Here is an example of what we're seeing in the controller.log: > {{[2016-06-28 14:15:35,416] WARN [Controller-151-to-broker-160-send-thread], > Controller 151 epoch 106 fails to send request {…} to broker Node(160, > broker.160.hostname, 9092). Reconnecting to broker. > (kafka.controller.RequestSendThread) > java.io.IOException: Connection to 160 was disconnected before the response > was read > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84) > at scala.Option.foreach(Option.scala:236) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84) > at > kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80) > at > kafka.utils.NetworkClientBlockingOps$.recurse$1(NetworkClientBlockingOps.scala:129) > at > kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntilFound$extension(NetworkClientBlockingOps.scala:139) > at > kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80) > at > kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:180) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:171) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > ... one each for all brokers (including the controller) ... > [2016-06-28 14:15:35,721] INFO [Controller-151-to-broker-160-send-thread], > Controller 151 connected to Node(160, broker.160.hostname, 9092) for sending > state change requests (kafka.controller.RequestSendThread) > … one each for all brokers (including the controller) ...}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)