Denovo1998 commented on code in PR #24716: URL: https://github.com/apache/pulsar/pull/24716#discussion_r2352615620
########## pip/pip-441.md: ########## @@ -0,0 +1,115 @@ +# PIP-441: Add Broker-Level Metrics for Skipped Non-Recoverable Data + +# Background knowledge + +Pulsar's `autoSkipNonRecoverableData` feature allows brokers to skip corrupted data during disaster recovery to maintain topic availability. The system uses two skip strategies: + +1. **Ledger-level skipping**: Skips entire ledgers when completely unrecoverable +2. **Entry-level skipping**: Skips specific entries within a ledger when only partially corrupted + +The entry level skipping was introduced in PIP-327 and refined in [PR #17753](https://github.com/apache/pulsar/pull/17753) to handle scenarios like ledger corruption, bookie failures, and partial data loss. + +# Motivation + +Currently, there is no visibility into when and how frequently non-recoverable data is being skipped, creating operational challenges: + +- **No alerting capability** when data loss occurs +- **No audit trail** for compliance and data integrity requirements +- **Cannot distinguish** between healthy systems and those silently skipping data +- **Cannot determine** if data loss is wholesale (ledgers) or partial (entries) +- **Limited capacity planning** without understanding failure patterns + +# Goals + +## In Scope + +Add two broker-level metrics: +- `pulsar_broker_non_recoverable_ledgers_skipped_total` - Count of ledgers skipped +- `pulsar_broker_non_recoverable_entries_skipped_total` - Count of entries skipped + +## Out of Scope + +- Topic/subscription-level metrics (would burden metrics system with high cardinality) +- Historical tracking of specific ledgers/entries skipped +- Changes to existing `autoSkipNonRecoverableData` functionality + +# High Level Design + +Two broker-level counters will be added to `BrokerOperabilityMetrics`: + +- **Ledger counter**: Incremented in `ManagedLedgerImpl.skipNonRecoverableLedger()` +- **Entry counter**: Incremented in `ManagedCursorImpl.skipNonRecoverableEntries()` + +Both metrics are exposed via the existing Prometheus `/metrics` endpoint. + +# Detailed Design + +## Implementation Details + +**BrokerOperabilityMetrics Changes:** +```java +private final LongAdder nonRecoverableLedgersSkippedCount; +private final LongAdder nonRecoverableEntriesSkippedCount; + +public void recordNonRecoverableLedgerSkipped() { + this.nonRecoverableLedgersSkippedCount.increment(); +} + +public void recordNonRecoverableEntriesSkipped(long entriesCount) { + this.nonRecoverableEntriesSkippedCount.add(entriesCount); +} +``` + +**Integration Points:** +- `ManagedLedgerImpl.skipNonRecoverableLedger()` → calls `recordNonRecoverableLedgerSkipped()` +- `ManagedCursorImpl.skipNonRecoverableEntries()` → calls `recordNonRecoverableEntriesSkipped(count)` + +**OpenTelemetry Support:** +- `pulsar.broker.non_recoverable_ledger.skip.count` +- `pulsar.broker.non_recoverable_entries.skip.count` + +## Public-facing Changes + +### Metrics + +| Metric Name | Description | Type | +|-------------|-------------|------| +| `pulsar_broker_non_recoverable_ledgers_skipped_total` | Count of ledgers skipped when `autoSkipNonRecoverableData` enabled | Counter | +| `pulsar_broker_non_recoverable_entries_skipped_total` | Count of entries skipped when `autoSkipNonRecoverableData` enabled | Counter | + +**Labels:** `broker`, `cluster` + +# Monitoring + +**Use Cases:** +- **Alerting**: Get notified when data loss occurs +- **SLA Monitoring**: Track data durability metrics +- **Root Cause Analysis**: Compare metrics to understand if issues are systematic (ledger-level) or localized (entry-level) +- **Investigation**: Use metrics for alerting, then check broker logs for specific topic details Review Comment: Broker logs can pinpoint which specific topic is having an issue. Is this log currently recorded in the code? If so, where is it located? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
