Hi all.

I'm running a two node cluster that has been rock solid for almost a year
and a half.  We experienced an outage of one of the two brokers this
morning, and from the logs, I'm not sure what happened, and how to prevent
it.

The Kafka version is 0.8.1.1 with Scala 2.10.  Java version is Open JDK
version 1.8.0_65

Everything running fine, then:

[2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,306] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)
[2016-04-13 11:01:28,334] WARN Reconnect due to socket error: Received -1
when reading from channel, socket has likely been closed.
(kafka.consumer.SimpleConsumer)

[2016-04-13 11:01:28,352] ERROR [ReplicaFetcherThread-1-0], Error in fetch
Name: FetchRequest; Version: 0; CorrelationId: 9644043; ClientId:
ReplicaFetcherThread-1-0; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 bytes;
RequestInfo:* [snip of every topic and partition on the broker listed here]*
java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:454)
        at sun.nio.ch.Net.connect(Net.java:446)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
        at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
        at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
        at kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57)
        at
kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79)
        at
kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

The logs then spam that ERROR and Exception 5406 times between:
2016-04-13 11:01:28,352 and 2016-04-13 11:01:31,994

Then I get this message twice:
[2016-04-13 11:01:31,997] INFO [ReplicaFetcherManager on broker 1] Removed
fetcher for partitions [snip list of all my topics and partitions listed]

Then this:
[2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,061] INFO [ReplicaFetcherThread-1-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,113] INFO New leader is 1
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2016-04-13 11:01:32,113] INFO New leader is 1
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-1-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,192] INFO [ReplicaFetcherThread-0-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-0-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,194] INFO [ReplicaFetcherThread-3-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-3-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,392] INFO [ReplicaFetcherThread-2-0], Shutting down
(kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Stopped
 (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)
[2016-04-13 11:01:32,395] INFO [ReplicaFetcherThread-2-0], Shutdown
completed (kafka.server.ReplicaFetcherThread)


At this point, there are no more errors to the log file, but all the
consumers are still trying to consume from this broker, and are getting
Connection Refused exceptions.  It isn't until I cycled the broker that
things got back to normal.

Can anyone tell me what happened?  Or why consumers didn't recognize that
there was a problem with this broker and start consuming from the other one?

Can I provide any more details? :)

Thank you so much for your time!

Reply via email to