We are using the scala producer. From producer side, we have seen a lot of
error messages in producer during the time of incoming message drop:

Produce request with correlation id 31616255 failed due to
[trace_annotation,10]: kafka.common.NotLeaderForPartitionException

And a few (far less than the NotLeaderForPartitionException) of those:

2015-01-15 17:57:54,412 WARN KafkaSink-kafka_0_8_logtrace_asg-Sender-2
DefaultEventHandler - Failed to send producer request with correlation id
31554484 to broker 20 with
data for partitions [trace_annotation,9]
java.net.SocketTimeoutException

What's interesting is that the broker cluster recovered from the message
drop only after we restarted consumers. Also during that time, we have
observed is that the garbage collection time for the brokers increased 5
times. The AllBrokersFetchRequestRateAndTimeMs_9X metric from the consumer
side also increased from a few hundred ms to several seconds.

What we don't know is whether the garbage collection time increase is the
cause or the effect of the problem. It seems that after the rebalance, some
resources in the brokers was tied up and it was only released after restart
of consumers.


On Thu, Jan 15, 2015 at 8:15 AM, Joel Koshy <jjkosh...@gmail.com> wrote:

> > Is leadership rebalance a safe operation?
>
> Yes - we use it routinely. For any partition, there should only be a
> brief (order of seconds) period of rejected messages as leaders move.
> When that happens the client should refresh metadata and discover the
> new leader. Are you using the Java producer? Do you see any errors in
> the producer logs?
>
> On Wed, Jan 14, 2015 at 06:36:27PM -0800, Allen Wang wrote:
> > Hello,
> >
> > We did a manual leadership rebalance (using
> > PreferredReplicaLeaderElectionCommand) under heavy load and found that
> > there is a significant drop of incoming messages to the broker cluster
> for
> > more than an hour. Looking at broker log, we found a lot of errors like
> > this:
> >
> > 2015-01-15 00:00:03,330 ERROR kafka.utils.Logging$class:103
> > [kafka-processor-7101-0] [error] Closing socket for /10.213.156.41
> > because of error
> > java.io.IOException: Connection reset by peer
> >       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >       at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >       at kafka.utils.Utils$.read(Utils.scala:375)
> >       at
> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
> >       at kafka.network.Processor.read(SocketServer.scala:347)
> >       at kafka.network.Processor.run(SocketServer.scala:245)
> >       at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Is leadership rebalance a safe operation?
> >
> > Thanks.
>
>

Reply via email to