Good day all,

We're running a good sized Kafka cluster, running 0.8.1, and during our
peak traffic times replication falls behind.  I've been doing some reading
about parameters for tuning replication, but I'd love some real world
experience and insight.
Some general questions:

* Does Kafka 'like' lots of small partitions for replication, or larger
ones?  ie: if I'm passing 1Gbps into a topic, will replication be happier
if that's one partition, or many partitions?

* How can we 'up' the priority of replication over other actions?

* What is the most effective way to monitor the replication lag?  On
brokers with hundreds of partitions, the JMX data starts getting very
muddled and plentiful.  I'm trying to find something we can graph/dashboard
to say 'replication is in X state'.  When we look at it in aggregate, we
assume that 'big numbers are further behind', but then sometimes find
negative numbers as well?

We are looking to make sure our cluster is well balanced, but have run into
the problem that we can't move a partition until it's got all its ISRs, but
the box is so overloaded it never catches up, so we can't take any load
off, lather, rinse, repeat.

Ultimately, we need to add even more hardware to the busy clusters, but
that times some time, so I'm hoping we can get some ideas about what we can
tune and improve.

Thanks,

Todd.

Reply via email to