Re: ReplicaFetcherThread Error, Massive Logging, and Leader Flapping

Jiangjie Qin Thu, 16 Apr 2015 13:20:23 -0700

It seems there are many different symptoms you see...
Maybe we can start from leader flapping issue. Any findings in controller
log?


Jiangjie (Becket) Qin
 


On 4/16/15, 12:09 PM, "Kyle Banker" <kyleban...@gmail.com> wrote:

>Hi,
>
>I've run into a pretty serious production issue with Kafka 0.8.2, and I'm
>wondering what my options are.
>
>
>ReplicaFetcherThread Error
>
>I have a broker on a 9-node cluster that went down for a couple of hours.
>When it came back up, it started spewing constant errors of the following
>form:
>
>INFO Reconnect due to socket error:
>java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
>[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch
>Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId:
>ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>bytes;
>RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1
>when reading from channel, socket has likely been closed.
>(kafka.server.ReplicaFetcherThread)
>
>
>Massive Logging
>
>This produced around 300GB of new logs in a 24-hour period and rendered
>the
>broker completely unresponsive.
>
>This broker hosts about 500 partitions spanning 40 or so topics (all
>topics
>have a replication factor of 3). One topic contains messages up to 100MB
>in
>size. The remaining topics have messages no larger than 10MB.
>
>It appears that I've hit this bug:
>https://issues.apache.org/jira/browse/KAFKA-1196
>
>
>"Leader Flapping"
>
>I can get the broker to come online without logging massively by reducing
>both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then
>starts
>resyncing all but the largest topic.
>
>Unfortunately, it also starts "leader flapping." That is, it continuously
>acquires and relinquishes partition leadership. There is nothing of note
>in
>the logs while this is happening, but the consumer offset checker clearly
>shows this. The behavior significantly reduces cluster write throughput
>(since producers are constantly failing).
>
>The only solution I have is to leave the broker off. Is this a known
>"catch-22" situation? Is there anything that can be done to fix it?
>
>Many thanks in advance.

Re: ReplicaFetcherThread Error, Massive Logging, and Leader Flapping

Reply via email to