It seems there are many different symptoms you see... Maybe we can start from leader flapping issue. Any findings in controller log?
Jiangjie (Becket) Qin On 4/16/15, 12:09 PM, "Kyle Banker" <kyleban...@gmail.com> wrote: >Hi, > >I've run into a pretty serious production issue with Kafka 0.8.2, and I'm >wondering what my options are. > > >ReplicaFetcherThread Error > >I have a broker on a 9-node cluster that went down for a couple of hours. >When it came back up, it started spewing constant errors of the following >form: > >INFO Reconnect due to socket error: >java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) >[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch >Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId: >ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1 >bytes; >RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1 >when reading from channel, socket has likely been closed. >(kafka.server.ReplicaFetcherThread) > > >Massive Logging > >This produced around 300GB of new logs in a 24-hour period and rendered >the >broker completely unresponsive. > >This broker hosts about 500 partitions spanning 40 or so topics (all >topics >have a replication factor of 3). One topic contains messages up to 100MB >in >size. The remaining topics have messages no larger than 10MB. > >It appears that I've hit this bug: >https://issues.apache.org/jira/browse/KAFKA-1196 > > >"Leader Flapping" > >I can get the broker to come online without logging massively by reducing >both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then >starts >resyncing all but the largest topic. > >Unfortunately, it also starts "leader flapping." That is, it continuously >acquires and relinquishes partition leadership. There is nothing of note >in >the logs while this is happening, but the consumer offset checker clearly >shows this. The behavior significantly reduces cluster write throughput >(since producers are constantly failing). > >The only solution I have is to leave the broker off. Is this a known >"catch-22" situation? Is there anything that can be done to fix it? > >Many thanks in advance.