To illustrate my point, I will use "allTopicsOwnedPartitionsCount" guage
from  ZookeeperConsumerConnector as an example. It captures number of
partitions for a topic that has been assigned owner for the consumer group.
let's say that I have a topic with 9 partitions. this metrics should
normally report value 9. I can setup alert
if allTopicsOwnedPartitionsCount <9.

here are the drawbacks of this kind of metric.
1) if our metrics report/aggregation system has data loss and cause the
value reported as zero, we can't really distinguish whether it's an real
error or it is data loss. so we can get false positive/alarm from data loss
2) if we change the number of partitions (e.g. from 9 to 18). we need to
remember to change the alert rule to "allTopicsOwnedPartitionsCount <18".
this kind of coupling is a maintenance nightmare.

A more explicit metric is "NoOwnerPartitionsCount". it should be zero
normally. if it is not zero, we should be alerted. this way, we won't get
false alarm from data loss.

We don't have to change/fix this particular example since a new consumer is
being worked on. But in new consumer please consider more explicit error
signals.

Thanks,
Steven

Reply via email to