+1 (non-binding) It's very useful -- I've left a minor comment on the implementation pr.
On 2025/09/10 18:29:00 PengHui Li wrote: > Hi Team, > > This is the official VOTE thread for PIP-441: Add Broker-Level Metrics for > Skipped Non-Recoverable Data > > Currently, when Pulsar's autoSkipNonRecoverableData feature skips > corrupted data to maintain topic availability, there is no visibility into > when and > how frequently this occurs. This creates operational blind spots where > administrators > cannot be alerted when data loss happens, have no audit trail for > compliance requirements, > and cannot distinguish between healthy systems and those silently losing > data. > > Without these metrics, operators cannot determine whether issues are > systematic (entire ledgers lost) or localized (partial corruption > scenarios). > > Proposed Solution: This PIP proposes adding two new broker-level metrics to > the BrokerOperabilityMetrics class: > > 1. pulsar_broker_non_recoverable_ledgers_skipped_total: > A counter incremented in ManagedLedgerImpl.skipNonRecoverableLedger() > each time an entire ledger is skipped due to complete > unrecoverability. > 2. pulsar_broker_non_recoverable_entries_skipped_total: > A counter incremented in > ManagedCursorImpl.skipNonRecoverableEntries() > by the number of entries skipped when only partial ledger corruption > occurs. > > The broker-level approach avoids adding a high-cardinality burden to the > metrics > system that would occur with topic-level metrics in large clusters. > Operators can > use these broker-level metrics for alerting and monitoring trends, then > leverage > existing broker logs for detailed forensic analysis of specific affected > topics. > > The full proposal is available for review here: > https://github.com/apache/pulsar/pull/24716 > > The discussion mailing list: > https://lists.apache.org/thread/b638towc7o4qb8dsozys4c14s00yflfj > > Pushed out the implementation PR: > https://github.com/apache/pulsar/pull/24726 > > Regards, > Penghui > Regards, Yike
