I think your answers are pretty spot-on, Joel. Under Replicated Count is the metric that we monitor to make sure the cluster is healthy. It lets us know when a broker is down (because all the numbers except one broker are elevated), or when a broker is struggling (low counts fluctuating across a few hosts).
As far as lots of small partitions vs. a few large partitions, we prefer the former. It means we can spread the load out over brokers more evenly. -Todd On Tue, Nov 4, 2014 at 10:07 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > Ops-experts can share more details but here are some comments: > > > > * Does Kafka 'like' lots of small partitions for replication, or larger > > ones? ie: if I'm passing 1Gbps into a topic, will replication be happier > > if that's one partition, or many partitions? > > Since you also have to account for the NIC utilization by replica > fetches it is better to split a heavy topic into many partitions. > > > * How can we 'up' the priority of replication over other actions? > > If you do the above, this should not be necessary but you could > increase the number of replica fetchers. (num.replica.fetchers) > > > * What is the most effective way to monitor the replication lag? On > > brokers with hundreds of partitions, the JMX data starts getting very > > muddled and plentiful. I'm trying to find something we can > graph/dashboard > > to say 'replication is in X state'. When we look at it in aggregate, we > > assume that 'big numbers are further behind', but then sometimes find > > negative numbers as well? > > The easiest mbean to look at is the underreplicated partition count. > This is at the broker-level so it is coarse-grained. If it is > 0 you > can use various tools to do mbean queries to figure out which > partition is lagging behind. Another thing you can look at is the ISR > shrink/expand rate. If you see a lot of churn you may need to tune the > settings that affect ISR maintenance (replica.lag.time.max.ms, > replica.lag.max.messages). > > > -- > Joel >