Joel, Thanks for your input - it fits what I was thinking, so it's good confirmation.
> The easiest mbean to look at is the underreplicated partition count. > This is at the broker-level so it is coarse-grained. If it is > 0 you > can use various tools to do mbean queries to figure out which > partition is lagging behind. Another thing you can look at is the ISR > shrink/expand rate. If you see a lot of churn you may need to tune the > settings that affect ISR maintenance (replica.lag.time.max.ms, > replica.lag.max.messages). and Todd Palino said: > Under Replicated Count is the metric that we monitor to make sure the > cluster is healthy. We report/alert on under replicated partitions. what i'm trying to do is get away from event driven alerts to the NOC/ops people, and give them something qualitative (replication is {ok|a little behind|behind|really behind|really really behind|oh no we're doomed} so we know how to respond appropriately. I don't really want ops folks getting called at 2am on a Saturday because a single replica is behind by a few thousand messages .. however I *do* want someone called if we're a billion messages behind. If I look at 'KAFKA|kafka.server|FetcherLagMetrics|ReplicaFetcherThread-.*:Value' , can I use that as my measure of badness/behindness? In a similar vein, at what point do you/Todd/others wake someone up? How many replicas out of sync, by how much? What is the major concern point, vs 'meh, it'll catch up soon'? I know it's likely different between different environments, but as I'm new to this, I'd love to know how others see things. Thanks!