We have our threshold for under replicated set at anything over 2. The reason we picked that number is because we have a cluster that tends to take very high traffic for short periods of time, and 2 gets us around the false positives (with a careful balance of the partitions in the cluster). We're also holding ourselves to a fairly strict standard, so whenever we see URP for any reason, we're investigating what's going on and resolving it so it doesn't happen again.
Technically, we're supposed to be called for any URP alert. In reality, we don't have any in normal operation unless we have a problem like a down broker. If replicas are falling behind due to network congestion (or other resource exhaustion), we balance things out, expand the cluster, or find our problem producer or consumer and fix them. -Todd On Tue, Nov 4, 2014 at 12:13 PM, Todd S <t...@borked.ca> wrote: > Joel, > > Thanks for your input - it fits what I was thinking, so it's good > confirmation. > > > The easiest mbean to look at is the underreplicated partition count. > > This is at the broker-level so it is coarse-grained. If it is > 0 you > > can use various tools to do mbean queries to figure out which > > partition is lagging behind. Another thing you can look at is the ISR > > shrink/expand rate. If you see a lot of churn you may need to tune the > > settings that affect ISR maintenance (replica.lag.time.max.ms, > > replica.lag.max.messages). > > and Todd Palino said: > > > Under Replicated Count is the metric that we monitor to make sure the > > cluster is healthy. > > We report/alert on under replicated partitions. what i'm trying to do > is get away from event driven alerts to the NOC/ops people, and give > them something qualitative (replication is {ok|a little > behind|behind|really behind|really really behind|oh no we're doomed} > so we know how to respond appropriately. I don't really want ops > folks getting called at 2am on a Saturday because a single replica is > behind by a few thousand messages .. however I *do* want someone > called if we're a billion messages behind. > > If I look at > 'KAFKA|kafka.server|FetcherLagMetrics|ReplicaFetcherThread-.*:Value' > , can I use that as my measure of badness/behindness? > > > In a similar vein, at what point do you/Todd/others wake someone up? > How many replicas out of sync, by how much? What is the major concern > point, vs 'meh, it'll catch up soon'? I know it's likely different > between different environments, but as I'm new to this, I'd love to > know how others see things. > > Thanks! >