Hi all. I'm running a two node cluster that has been rock solid for almost a year and a half. We experienced an outage of one of the two brokers this morning, and from the logs, I'm not sure what happened, and how to prevent it.
The Kafka version is 0.8.1.1 with Scala 2.10. Java version is Open JDK version 1.8.0_65 Everything running fine, then: [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 when reading from channel, socket has likely been closed. (kafka.consumer.SimpleConsumer) [2016-04-13 11:01:28,352] ERROR [ReplicaFetcherThread-1-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 9644043; ClientId: ReplicaFetcherThread-1-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo:* [snip of every topic and partition on the broker listed here]* java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect0(Native Method) at sun.nio.ch.Net.connect(Net.java:454) at sun.nio.ch.Net.connect(Net.java:446) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57) at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44) at kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) The logs then spam that ERROR and Exception 5406 times between: 2016-04-13 11:01:28,352 and 2016-04-13 11:01:31,994 Then I get this message twice: [2016-04-13 11:01:31,997] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions [snip list of all my topics and partitions listed] Then this: [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,113] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2016-04-13 11:01:32,113] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown completed (kafka.server.ReplicaFetcherThread) [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown completed (kafka.server.ReplicaFetcherThread) At this point, there are no more errors to the log file, but all the consumers are still trying to consume from this broker, and are getting Connection Refused exceptions. It isn't until I cycled the broker that things got back to normal. Can anyone tell me what happened? Or why consumers didn't recognize that there was a problem with this broker and start consuming from the other one? Can I provide any more details? :) Thank you so much for your time!