BewareMyPower opened a new issue, #25326: URL: https://github.com/apache/pulsar/issues/25326
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Motivation The current `AutoClusterFailoverBuilder` exposes: - `failoverDelay(long, TimeUnit)` - `switchBackDelay(long, TimeUnit)` - `checkInterval(long, TimeUnit)` It uses periodic checks plus timestamps (`failedTimestamp` / `recoverTimestamp`) to decide when to switch. These configs are ambiguous when the implementation actually makes decisions based on periodic health probes. The current behavior depends on `checkInterval`, and the delay values are effectively converted into consecutive probe counts. That makes the real semantics harder to understand from the API alone. For example, with the current API: - `checkInterval = 400 ms` - `failoverDelay = 1000 ms` Users may expect failover after roughly 1000 ms of unavailability. In practice, the implementation only evaluates state on each probe, so the actual switching behavior depends on the probe cadence and on when the first failed observation is recorded. This becomes even harder to reason about when availability is unstable between checks. A concrete example: - `t0`: currently using the primary cluster - `t0 + 400 ms`: the probe detects the primary is unavailable, so the implementation records the first failed timestamp - `t0 + 800 ms`: the probe still sees the primary unavailable, but only 400 ms have elapsed since the first failed observation - `t0 + 1200 ms`: the probe still sees the primary unavailable, but only 800 ms have elapsed since the first failed observation - `t0 + 1600 ms`: the probe still sees the primary unavailable, and now the elapsed time since the first failed observation exceeds 1000 ms, so failover is attempted From an operator's point of view, the service remained unavailable for about 1600 ms before failover was triggered, even though `failoverDelay` was configured as 1000 ms. An unstable-network example is also harder to explain with delay-based configuration: - `t0`: primary becomes unavailable - `t0 + 400 ms`: probe sees primary unavailable - `t0 + 800 ms`: probe sees primary available again - `t0 + 1200 ms`: probe sees primary available - `t0 + 1600 ms`: probe sees primary unavailable again The underlying issue is that the delay-based contract is harder to understand than the actual probe-based decision model. The `switchBackDelay` config has the similar issue. ### Solution Consider evolving the Java-facing `AutoClusterFailover` API to expose the configuration in terms of consecutive probe thresholds: For example: - `failoverThreshold`: number of consecutive failed probes required before switching away from the current cluster - `switchBackThreshold`: number of consecutive successful probes to the primary required before switching back from a secondary Builder APIs would become: `failoverThreshold(int threshold)` and `switchBackThreshold(int threshold)` This matches the real decision model used by health probes and removes the mental conversion from durations to probe counts. There is also precedent in Pulsar for a threshold-based failover API: `SameAuthParamsLookupAutoClusterFailover` already exposes `failoverThreshold` and `recoverThreshold`. **Note: I might introduce a new `ServiceInfoProvider` interface to replace the current `ServiceUrlProvider` interface, so a new API won't break anything.** ### Alternatives Keep the existing delay-based APIs and document more precisely that: - delays are evaluated only on probe boundaries - failover and switch back are determined by periodic observations rather than continuous monitoring - the elapsed wall-clock time seen by users can overshoot the configured delay depending on `checkInterval` This is workable and preserves full Java compatibility, but it still leaves the contract harder to understand. Another option is to add threshold-based APIs in Java and C++ while keeping the existing delay-based APIs for backward compatibility. That is safer for adoption, but it adds redundant configuration and API complexity. ### Anything else? _No response_ ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
