Another kind of error messages is found in the kafka state change log after
leadership rebalance:

2015-01-15 00:01:39,895 WARN  kafka.utils.Logging$class:83
[kafka-request-handler-0] [warn] Broker 8 received invalid
LeaderAndIsr request with correlation id 221 from controller 0 epoch
19 with an older leader epoch 18 for partition [mapcommandaudit,4],
current leader epoch is 18



On Thu, Jan 15, 2015 at 11:55 AM, Allen Wang <aw...@netflix.com> wrote:

> We are using the scala producer. From producer side, we have seen a lot of
> error messages in producer during the time of incoming message drop:
>
> Produce request with correlation id 31616255 failed due to
> [trace_annotation,10]: kafka.common.NotLeaderForPartitionException
>
> And a few (far less than the NotLeaderForPartitionException) of those:
>
> 2015-01-15 17:57:54,412 WARN KafkaSink-kafka_0_8_logtrace_asg-Sender-2
> DefaultEventHandler - Failed to send producer request with correlation id
> 31554484 to broker 20 with
> data for partitions [trace_annotation,9]
> java.net.SocketTimeoutException
>
> What's interesting is that the broker cluster recovered from the message
> drop only after we restarted consumers. Also during that time, we have
> observed is that the garbage collection time for the brokers increased 5
> times. The AllBrokersFetchRequestRateAndTimeMs_9X metric from the consumer
> side also increased from a few hundred ms to several seconds.
>
> What we don't know is whether the garbage collection time increase is the
> cause or the effect of the problem. It seems that after the rebalance, some
> resources in the brokers was tied up and it was only released after restart
> of consumers.
>
>
> On Thu, Jan 15, 2015 at 8:15 AM, Joel Koshy <jjkosh...@gmail.com> wrote:
>
>> > Is leadership rebalance a safe operation?
>>
>> Yes - we use it routinely. For any partition, there should only be a
>> brief (order of seconds) period of rejected messages as leaders move.
>> When that happens the client should refresh metadata and discover the
>> new leader. Are you using the Java producer? Do you see any errors in
>> the producer logs?
>>
>> On Wed, Jan 14, 2015 at 06:36:27PM -0800, Allen Wang wrote:
>> > Hello,
>> >
>> > We did a manual leadership rebalance (using
>> > PreferredReplicaLeaderElectionCommand) under heavy load and found that
>> > there is a significant drop of incoming messages to the broker cluster
>> for
>> > more than an hour. Looking at broker log, we found a lot of errors like
>> > this:
>> >
>> > 2015-01-15 00:00:03,330 ERROR kafka.utils.Logging$class:103
>> > [kafka-processor-7101-0] [error] Closing socket for /10.213.156.41
>> > because of error
>> > java.io.IOException: Connection reset by peer
>> >       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> >       at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> >       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> >       at kafka.utils.Utils$.read(Utils.scala:375)
>> >       at
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> >       at kafka.network.Processor.read(SocketServer.scala:347)
>> >       at kafka.network.Processor.run(SocketServer.scala:245)
>> >       at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > Is leadership rebalance a safe operation?
>> >
>> > Thanks.
>>
>>
>

Reply via email to