Sorry, if this sounds lame, but can you ping or telnet? On Wed, Apr 13, 2016 at 9:55 AM, Chris Neal <cwn...@gmail.com> wrote:
> Hi all. > > I'm running a two node cluster that has been rock solid for almost a year > and a half. We experienced an outage of one of the two brokers this > morning, and from the logs, I'm not sure what happened, and how to prevent > it. > > The Kafka version is 0.8.1.1 with Scala 2.10. Java version is Open JDK > version 1.8.0_65 > > Everything running fine, then: > > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > when reading from channel, socket has likely been closed. > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,352] ERROR [ReplicaFetcherThread-1-0], Error in fetch > Name: FetchRequest; Version: 0; CorrelationId: 9644043; ClientId: > ReplicaFetcherThread-1-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes; > RequestInfo:* [snip of every topic and partition on the broker listed > here]* > java.net.ConnectException: Connection refused > at sun.nio.ch.Net.connect0(Native Method) > at sun.nio.ch.Net.connect(Net.java:454) > at sun.nio.ch.Net.connect(Net.java:446) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57) > at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44) > at kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57) > at > kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79) > at > > kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) > at > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107) > at > > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) > > The logs then spam that ERROR and Exception 5406 times between: > 2016-04-13 11:01:28,352 and 2016-04-13 11:01:31,994 > > Then I get this message twice: > [2016-04-13 11:01:31,997] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions [snip list of all my topics and partitions listed] > > Then this: > [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,113] INFO New leader is 1 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-13 11:01:32,113] INFO New leader is 1 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown > completed (kafka.server.ReplicaFetcherThread) > > > At this point, there are no more errors to the log file, but all the > consumers are still trying to consume from this broker, and are getting > Connection Refused exceptions. It isn't until I cycled the broker that > things got back to normal. > > Can anyone tell me what happened? Or why consumers didn't recognize that > there was a problem with this broker and start consuming from the other > one? > > Can I provide any more details? :) > > Thank you so much for your time! > -- Radha Krishna, Proddaturi 253-234-5657