[I] Enhancement: clarify `AutoClusterFailover` failover semantics and consider explicit probe-threshold APIs [pulsar]

via GitHub Mon, 16 Mar 2026 05:27:39 -0700


BewareMyPower opened a new issue, #25326:
URL: https://github.com/apache/pulsar/issues/25326


   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   The current `AutoClusterFailoverBuilder` exposes:
   
   - `failoverDelay(long, TimeUnit)`
   - `switchBackDelay(long, TimeUnit)`
   - `checkInterval(long, TimeUnit)`
   
   It uses periodic checks plus timestamps (`failedTimestamp` / 
`recoverTimestamp`) to decide when to switch.
   
   These configs are ambiguous when the implementation actually makes decisions 
based on periodic health probes.
   
   The current behavior depends on `checkInterval`, and the delay values are 
effectively converted into consecutive probe counts. That makes the real 
semantics harder to understand from the API alone.
   
   For example, with the current API:
   
   - `checkInterval = 400 ms`
   - `failoverDelay = 1000 ms`
   
   Users may expect failover after roughly 1000 ms of unavailability. In 
practice, the implementation only evaluates state on each probe, so the actual 
switching behavior depends on the probe cadence and on when the first failed 
observation is recorded. This becomes even harder to reason about when 
availability is unstable between checks.
   
   A concrete example:
   
   - `t0`: currently using the primary cluster
   - `t0 + 400 ms`: the probe detects the primary is unavailable, so the 
implementation records the first failed timestamp
   - `t0 + 800 ms`: the probe still sees the primary unavailable, but only 400 
ms have elapsed since the first failed observation
   - `t0 + 1200 ms`: the probe still sees the primary unavailable, but only 800 
ms have elapsed since the first failed observation
   - `t0 + 1600 ms`: the probe still sees the primary unavailable, and now the 
elapsed time since the first failed observation exceeds 1000 ms, so failover is 
attempted
   
   From an operator's point of view, the service remained unavailable for about 
1600 ms before failover was triggered, even though `failoverDelay` was 
configured as 1000 ms.
   
   An unstable-network example is also harder to explain with delay-based 
configuration:
   
   - `t0`: primary becomes unavailable
   - `t0 + 400 ms`: probe sees primary unavailable
   - `t0 + 800 ms`: probe sees primary available again
   - `t0 + 1200 ms`: probe sees primary available
   - `t0 + 1600 ms`: probe sees primary unavailable again
   
   The underlying issue is that the delay-based contract is harder to 
understand than the actual probe-based decision model.
   
   The `switchBackDelay` config has the similar issue.
   
   ### Solution
   
   Consider evolving the Java-facing `AutoClusterFailover` API to expose the 
configuration in terms of consecutive probe thresholds:
   
   For example:
   
   - `failoverThreshold`: number of consecutive failed probes required before 
switching away from the current cluster
   - `switchBackThreshold`: number of consecutive successful probes to the 
primary required before switching back from a secondary
   
   Builder APIs would become: `failoverThreshold(int threshold)` and 
`switchBackThreshold(int threshold)`
   
   This matches the real decision model used by health probes and removes the 
mental conversion from durations to probe counts.
   
   There is also precedent in Pulsar for a threshold-based failover API: 
`SameAuthParamsLookupAutoClusterFailover` already exposes `failoverThreshold` 
and `recoverThreshold`.
   
   **Note: I might introduce a new `ServiceInfoProvider` interface to replace 
the current `ServiceUrlProvider` interface, so a new API won't break anything.**
   
   ### Alternatives
   
   Keep the existing delay-based APIs and document more precisely that:
   
   - delays are evaluated only on probe boundaries
   - failover and switch back are determined by periodic observations rather 
than continuous monitoring
   - the elapsed wall-clock time seen by users can overshoot the configured 
delay depending on `checkInterval`
   
   This is workable and preserves full Java compatibility, but it still leaves 
the contract harder to understand.
   
   Another option is to add threshold-based APIs in Java and C++ while keeping 
the existing delay-based APIs for backward compatibility. That is safer for 
adoption, but it adds redundant configuration and API complexity.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Enhancement: clarify `AutoClusterFailover` failover semantics and consider explicit probe-threshold APIs [pulsar]

Reply via email to