Ops-experts can share more details but here are some comments:
> 
> * Does Kafka 'like' lots of small partitions for replication, or larger
> ones?  ie: if I'm passing 1Gbps into a topic, will replication be happier
> if that's one partition, or many partitions?

Since you also have to account for the NIC utilization by replica
fetches it is better to split a heavy topic into many partitions.

> * How can we 'up' the priority of replication over other actions?

If you do the above, this should not be necessary but you could
increase the number of replica fetchers. (num.replica.fetchers)

> * What is the most effective way to monitor the replication lag?  On
> brokers with hundreds of partitions, the JMX data starts getting very
> muddled and plentiful.  I'm trying to find something we can graph/dashboard
> to say 'replication is in X state'.  When we look at it in aggregate, we
> assume that 'big numbers are further behind', but then sometimes find
> negative numbers as well?

The easiest mbean to look at is the underreplicated partition count.
This is at the broker-level so it is coarse-grained. If it is > 0 you
can use various tools to do mbean queries to figure out which
partition is lagging behind. Another thing you can look at is the ISR
shrink/expand rate. If you see a lot of churn you may need to tune the
settings that affect ISR maintenance (replica.lag.time.max.ms,
replica.lag.max.messages).


-- 
Joel

Reply via email to