Joel,

Thanks for your input - it fits what I was thinking, so it's good confirmation.

> The easiest mbean to look at is the underreplicated partition count.
> This is at the broker-level so it is coarse-grained. If it is > 0 you
> can use various tools to do mbean queries to figure out which
> partition is lagging behind. Another thing you can look at is the ISR
> shrink/expand rate. If you see a lot of churn you may need to tune the
> settings that affect ISR maintenance (replica.lag.time.max.ms,
> replica.lag.max.messages).

and Todd Palino said:

> Under Replicated Count is the metric that we monitor to make sure the
> cluster is healthy.

We report/alert on under replicated partitions.  what i'm trying to do
is get away from event driven alerts to the NOC/ops people, and give
them something qualitative (replication is {ok|a little
behind|behind|really behind|really really behind|oh no we're doomed}
so we know how to respond appropriately.  I don't really want ops
folks getting called at 2am on a Saturday because a single replica is
behind by a few thousand messages .. however I *do* want someone
called if we're a billion messages behind.

If I look at  
'KAFKA|kafka.server|FetcherLagMetrics|ReplicaFetcherThread-.*:Value'
, can I use that as my measure of badness/behindness?


In a similar vein, at what point do you/Todd/others wake someone up?
How many replicas out of sync, by how much?  What is the major concern
point, vs 'meh, it'll catch up soon'?  I know it's likely different
between different environments, but as I'm new to this, I'd love to
know how others see things.

Thanks!

Reply via email to