Optimal value for leader.imbalance.check.interval.seconds on a large cluster?

Tieman,Brian Tue, 25 Apr 2017 17:33:36 -0700

We have a 50 node Kafka cluster that manages thousands of partitions.  We’re 
interested in optimizing the auto unbalanced leadership election during rolling 
restarts of the cluster.  We have an automated process that performs broker 
restarts one by one.  After each restart the process verifies that all 
partition replicas on the broker are back in sync and then verifies that all 
the leadership partitions have been moved back to the broker before allowing 
the process to move on and restart the next broker in the cluster.


The default value for leader.imbalance.check.interval.seconds is 300s.  That 
implies that, on average, we’re waiting 150s after the broker is fully in sync 
before auto leadership election triggers and moves preferred leaders back.  
This could add nearly 2 hours to a full cluster restart.  We’d like to decrease 
leader.imbalance.check.interval.seconds as much as possible without causing 
instabilities in the cluster.

Is there documentation somewhere on how to tune 
leader.imbalance.check.interval.seconds for larger clusters?  We don’t 
understand what impacts this value has on the cluster as a whole.  I’m assuming 
setting it to 1s is ludicrous ☺but at the same time we’d like to trim it down 
as much as we reasonably can.

Thanks all!



CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Optimal value for leader.imbalance.check.interval.seconds on a large cluster?

Reply via email to