Another kind of error messages is found in the kafka state change log after leadership rebalance:
2015-01-15 00:01:39,895 WARN kafka.utils.Logging$class:83 [kafka-request-handler-0] [warn] Broker 8 received invalid LeaderAndIsr request with correlation id 221 from controller 0 epoch 19 with an older leader epoch 18 for partition [mapcommandaudit,4], current leader epoch is 18 On Thu, Jan 15, 2015 at 11:55 AM, Allen Wang <aw...@netflix.com> wrote: > We are using the scala producer. From producer side, we have seen a lot of > error messages in producer during the time of incoming message drop: > > Produce request with correlation id 31616255 failed due to > [trace_annotation,10]: kafka.common.NotLeaderForPartitionException > > And a few (far less than the NotLeaderForPartitionException) of those: > > 2015-01-15 17:57:54,412 WARN KafkaSink-kafka_0_8_logtrace_asg-Sender-2 > DefaultEventHandler - Failed to send producer request with correlation id > 31554484 to broker 20 with > data for partitions [trace_annotation,9] > java.net.SocketTimeoutException > > What's interesting is that the broker cluster recovered from the message > drop only after we restarted consumers. Also during that time, we have > observed is that the garbage collection time for the brokers increased 5 > times. The AllBrokersFetchRequestRateAndTimeMs_9X metric from the consumer > side also increased from a few hundred ms to several seconds. > > What we don't know is whether the garbage collection time increase is the > cause or the effect of the problem. It seems that after the rebalance, some > resources in the brokers was tied up and it was only released after restart > of consumers. > > > On Thu, Jan 15, 2015 at 8:15 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > >> > Is leadership rebalance a safe operation? >> >> Yes - we use it routinely. For any partition, there should only be a >> brief (order of seconds) period of rejected messages as leaders move. >> When that happens the client should refresh metadata and discover the >> new leader. Are you using the Java producer? Do you see any errors in >> the producer logs? >> >> On Wed, Jan 14, 2015 at 06:36:27PM -0800, Allen Wang wrote: >> > Hello, >> > >> > We did a manual leadership rebalance (using >> > PreferredReplicaLeaderElectionCommand) under heavy load and found that >> > there is a significant drop of incoming messages to the broker cluster >> for >> > more than an hour. Looking at broker log, we found a lot of errors like >> > this: >> > >> > 2015-01-15 00:00:03,330 ERROR kafka.utils.Logging$class:103 >> > [kafka-processor-7101-0] [error] Closing socket for /10.213.156.41 >> > because of error >> > java.io.IOException: Connection reset by peer >> > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >> > at sun.nio.ch.IOUtil.read(IOUtil.java:197) >> > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >> > at kafka.utils.Utils$.read(Utils.scala:375) >> > at >> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) >> > at kafka.network.Processor.read(SocketServer.scala:347) >> > at kafka.network.Processor.run(SocketServer.scala:245) >> > at java.lang.Thread.run(Thread.java:745) >> > >> > >> > Is leadership rebalance a safe operation? >> > >> > Thanks. >> >> >