Hello Bookkeeper Community,

I am looking to add an alert on my bookkeeper's metrics so I can know when
the number of underreplicated ledgers goes above 0. (If there is already a
good way to do that, I'm open to suggestions. I don't want to run any
commands against the cluster, though. I'd like to alert on real time
metrics.)

I looked at using the numUnderReplicatedLedger metric in the Auditor class,
as this is exactly the value that I'd like to use to trigger alerts.
However, this metric is implemented as an OpStatsLogger. I am using the
prometheus metrics provider. In that implementation, the OpStatsLogger is a
Counter, which can only ever increase, instead of a gauge, which can
increase and decrease. (The Bookkeeper Stats module has an OpStatsLogger as
well as a Gauge.) Given that the number of underreplicated ledgers goes up
and down, I think a gauge would more appropriately capture the nature of
the underreplicated ledger metric. Further, alerting semantics are trivial
when using a gauge (send me an alert if there are more than x
underreplicated ledgers after y minutes). Otherwise, I'm only left to alert
on a spike in the counter, which will be unreliable.

I concede that my perspective is limited to Prometheus. Is there a reason
this metric makes sense as an OpStatsLogger for other metrics providers? If
this metric ought to stay the same to prevent breaking changes, would it be
acceptable to add an extra metric? If a change is accepted, I'm happy to
implement it.

I think metrics should provide actionable insight into the current state of
a bookkeeper cluster, and in this case, I think a gauge would better
capture the thing being monitored: underreplicated ledgers.

Thanks!
Michael Marshall

Reply via email to