Hello Bookkeeper Community, I am looking to add an alert on my bookkeeper's metrics so I can know when the number of underreplicated ledgers goes above 0. (If there is already a good way to do that, I'm open to suggestions. I don't want to run any commands against the cluster, though. I'd like to alert on real time metrics.)
I looked at using the numUnderReplicatedLedger metric in the Auditor class, as this is exactly the value that I'd like to use to trigger alerts. However, this metric is implemented as an OpStatsLogger. I am using the prometheus metrics provider. In that implementation, the OpStatsLogger is a Counter, which can only ever increase, instead of a gauge, which can increase and decrease. (The Bookkeeper Stats module has an OpStatsLogger as well as a Gauge.) Given that the number of underreplicated ledgers goes up and down, I think a gauge would more appropriately capture the nature of the underreplicated ledger metric. Further, alerting semantics are trivial when using a gauge (send me an alert if there are more than x underreplicated ledgers after y minutes). Otherwise, I'm only left to alert on a spike in the counter, which will be unreliable. I concede that my perspective is limited to Prometheus. Is there a reason this metric makes sense as an OpStatsLogger for other metrics providers? If this metric ought to stay the same to prevent breaking changes, would it be acceptable to add an extra metric? If a change is accepted, I'm happy to implement it. I think metrics should provide actionable insight into the current state of a bookkeeper cluster, and in this case, I think a gauge would better capture the thing being monitored: underreplicated ledgers. Thanks! Michael Marshall