Broker rejoin with big replica lag

Andrew Otto Wed, 05 Feb 2014 09:06:41 -0800

Hi all!

I recently had a problem where one out of two of my brokers would not reboot 
due to a hardware failure.  The broker was down for almost a week before the 
required part came in and was fixed by our datacenter tech.  During that time, 
the live broker was able to handle all messages for all topics and partitions 
(which is awesome!).  The first broker is now back, and is trying to catch up 
with the messages that it missed for the during.  The lower volume topics are 
all caught up, but I have one high volume topic (around 40K msgs/sec) that is 
taking much longer.  I just took a few samples of Replica-MaxLag to see how 
long it would take to catch up.  Currently, it is behind about 12.5 million 
messages and is catching up at a rate of about 1600 msgs/sec.  At that rate, 
it’ll take around 9 days before the replica is caught up to the leader.


Is there any way to speed this up?

Or, alternatively, I don’t actually care about this topic’s history.  It is a 
new topic, and I know that it doesn't yet have any consumers.  I’d be fine with 
instructing both brokers to drop old logs and just start from the top of the 
log.  I could do this by manually deleting the topic (kafka data files and in 
zookeeper), but to do so properly with 0.8.0 I think I’d have to shut down the 
whole cluster, correct?  I’d rather not do this, as another topic does have a 
consumer and I don’t want to lose messages for it.

Thanks!
-Andrew Otto

Broker rejoin with big replica lag

Reply via email to