To illustrate my point, I will use "allTopicsOwnedPartitionsCount" guage from ZookeeperConsumerConnector as an example. It captures number of partitions for a topic that has been assigned owner for the consumer group. let's say that I have a topic with 9 partitions. this metrics should normally report value 9. I can setup alert if allTopicsOwnedPartitionsCount <9.
here are the drawbacks of this kind of metric. 1) if our metrics report/aggregation system has data loss and cause the value reported as zero, we can't really distinguish whether it's an real error or it is data loss. so we can get false positive/alarm from data loss 2) if we change the number of partitions (e.g. from 9 to 18). we need to remember to change the alert rule to "allTopicsOwnedPartitionsCount <18". this kind of coupling is a maintenance nightmare. A more explicit metric is "NoOwnerPartitionsCount". it should be zero normally. if it is not zero, we should be alerted. this way, we won't get false alarm from data loss. We don't have to change/fix this particular example since a new consumer is being worked on. But in new consumer please consider more explicit error signals. Thanks, Steven