:) Not lame. Valid question! Part of the problem is that the exception doesn't tell me where the connection refused is coming from. No IP address or hostname or application name is part of the error, so I have no idea to which system the problem is occurring!
I was able to ssh to the broker server, and the other broker in the cluster was still able to communicate with the problematic one, so there was definitely network connectivity at some level. Chris On Wed, Apr 13, 2016 at 7:38 PM, R Krishna <krishna...@gmail.com> wrote: > Sorry, if this sounds lame, but can you ping or telnet? > > On Wed, Apr 13, 2016 at 9:55 AM, Chris Neal <cwn...@gmail.com> wrote: > > > Hi all. > > > > I'm running a two node cluster that has been rock solid for almost a year > > and a half. We experienced an outage of one of the two brokers this > > morning, and from the logs, I'm not sure what happened, and how to > prevent > > it. > > > > The Kafka version is 0.8.1.1 with Scala 2.10. Java version is Open JDK > > version 1.8.0_65 > > > > Everything running fine, then: > > > > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > [2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1 > > when reading from channel, socket has likely been closed. > > (kafka.consumer.SimpleConsumer) > > > > [2016-04-13 11:01:28,352] ERROR [ReplicaFetcherThread-1-0], Error in > fetch > > Name: FetchRequest; Version: 0; CorrelationId: 9644043; ClientId: > > ReplicaFetcherThread-1-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 > bytes; > > RequestInfo:* [snip of every topic and partition on the broker listed > > here]* > > java.net.ConnectException: Connection refused > > at sun.nio.ch.Net.connect0(Native Method) > > at sun.nio.ch.Net.connect(Net.java:454) > > at sun.nio.ch.Net.connect(Net.java:446) > > at > sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > > at > kafka.network.BlockingChannel.connect(BlockingChannel.scala:57) > > at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44) > > at > kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57) > > at > > kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79) > > at > > > > > kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) > > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) > > at > > > > > kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) > > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > > at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107) > > at > > > > > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) > > at > > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) > > at > kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) > > > > The logs then spam that ERROR and Exception 5406 times between: > > 2016-04-13 11:01:28,352 and 2016-04-13 11:01:31,994 > > > > Then I get this message twice: > > [2016-04-13 11:01:31,997] INFO [ReplicaFetcherManager on broker 1] > Removed > > fetcher for partitions [snip list of all my topics and partitions listed] > > > > Then this: > > [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,113] INFO New leader is 1 > > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > > [2016-04-13 11:01:32,113] INFO New leader is 1 > > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped > > (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > [2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown > > completed (kafka.server.ReplicaFetcherThread) > > > > > > At this point, there are no more errors to the log file, but all the > > consumers are still trying to consume from this broker, and are getting > > Connection Refused exceptions. It isn't until I cycled the broker that > > things got back to normal. > > > > Can anyone tell me what happened? Or why consumers didn't recognize that > > there was a problem with this broker and start consuming from the other > > one? > > > > Can I provide any more details? :) > > > > Thank you so much for your time! > > > > > > -- > Radha Krishna, Proddaturi > 253-234-5657 >