Good day all, We're running a good sized Kafka cluster, running 0.8.1, and during our peak traffic times replication falls behind. I've been doing some reading about parameters for tuning replication, but I'd love some real world experience and insight.
Some general questions: * Does Kafka 'like' lots of small partitions for replication, or larger ones? ie: if I'm passing 1Gbps into a topic, will replication be happier if that's one partition, or many partitions? * How can we 'up' the priority of replication over other actions? * What is the most effective way to monitor the replication lag? On brokers with hundreds of partitions, the JMX data starts getting very muddled and plentiful. I'm trying to find something we can graph/dashboard to say 'replication is in X state'. When we look at it in aggregate, we assume that 'big numbers are further behind', but then sometimes find negative numbers as well? We are looking to make sure our cluster is well balanced, but have run into the problem that we can't move a partition until it's got all its ISRs, but the box is so overloaded it never catches up, so we can't take any load off, lather, rinse, repeat. Ultimately, we need to add even more hardware to the busy clusters, but that times some time, so I'm hoping we can get some ideas about what we can tune and improve. Thanks, Todd.