codelipenghui commented on code in PR #24716: URL: https://github.com/apache/pulsar/pull/24716#discussion_r2338081414
########## pip/pip-441.md: ########## @@ -0,0 +1,115 @@ +# PIP-441: Add Broker-Level Metrics for Skipped Non-Recoverable Data + +# Background knowledge + +Pulsar's `autoSkipNonRecoverableData` feature allows brokers to skip corrupted data during disaster recovery to maintain topic availability. The system uses two skip strategies: + +1. **Ledger-level skipping**: Skips entire ledgers when completely unrecoverable +2. **Entry-level skipping**: Skips specific entries within a ledger when only partially corrupted + +The entry level skipping was introduced in PIP-327 and refined in [PR #17753](https://github.com/apache/pulsar/pull/17753) to handle scenarios like ledger corruption, bookie failures, and partial data loss. + +# Motivation + +Currently, there is no visibility into when and how frequently non-recoverable data is being skipped, creating operational challenges: + +- **No alerting capability** when data loss occurs +- **No audit trail** for compliance and data integrity requirements +- **Cannot distinguish** between healthy systems and those silently skipping data +- **Cannot determine** if data loss is wholesale (ledgers) or partial (entries) +- **Limited capacity planning** without understanding failure patterns + +# Goals + +## In Scope + +Add two broker-level metrics: +- `pulsar_broker_non_recoverable_ledgers_skipped_total` - Count of ledgers skipped +- `pulsar_broker_non_recoverable_entries_skipped_total` - Count of entries skipped + +## Out of Scope + +- Topic/subscription-level metrics (would burden metrics system with high cardinality) Review Comment: Yes, that makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
