Re: [DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Runtian Liu Mon, 05 Jan 2026 11:19:04 -0800

Hi all,

First, thanks to Mick for summarizing the Slack discussion and helping
clarify the remaining concerns — that was very helpful.


One point I wanted to follow up on: I understand that for trunk there were
mentions of *other potential changes* that might also address issues
observed during node replacement. I want to make sure I’m not missing
anything important here.

So far, I haven’t seen a concrete alternative that directly addresses
the *LOCAL_QUORUM
write timeouts / Unavailable errors* caused by inflating blockFor with
pending replacements. The current proposal fixes this specific failure mode
in a targeted way, without changing semantics or weakening correctness
guarantees.

If there is a specific alternative approach being considered for trunk
(e.g., a different invariant or coordinator-side rule), I’d really
appreciate it being spelled out explicitly so we can evaluate it
side-by-side. Absent a concrete alternative that solves the same problem,
my preference would be to move forward with the current change.

At this point, it would be very helpful to get *more eyes on the ticket and
the PR* so we can converge on a decision:

   -

   JIRA: https://issues.apache.org/jira/browse/CASSANDRA-20993
   -

   PR (4.1): https://github.com/apache/cassandra/pull/4494

Any reviews or directional feedback—especially on trunk vs feature-flag
handling—would be greatly appreciated.

Thanks,
Runtian

On Sun, Dec 21, 2025 at 5:12 AM Mick <[email protected]> wrote:

> FYI there's a healthy slack thread discussing this found here:
> https://the-asf.slack.com/archives/CK23JSY2K/p1762834946972609
>
> From that the concerns left (iiuc) are:
>  - cases where a replacing node is removed and we return the original
> being-replaced node to the cluster,
>  - cases where multiple nodes are replacing and gossip leaves coordinators
> seeing different states of some, or no, nodes as JOINING/NORMAL.
>
> Given these concerns: the possibilities operators are doing things in
> unexpected ways; the feature flag is warranted on non-trunk branches.
>
> But does trunk need the flag ? It sounds like neither concern exists in
> trunk but there was a desire to do it with more changes in trunk which i'm
> not grokking…?
>
>
>
> > On 3 Dec 2025, at 18:32, Runtian Liu <[email protected]> wrote:
> >
> > Hi all,
> > Just bumping this thread in case it was missed the first time.
> > I’ve updated CASSANDRA-20993 with a detailed Correctness / Safety
> section that explains why excluding the pending replacement node from
> blockFor during node replacement does not weaken read-after-write
> guarantees for any combination of write CL and read CL. The key point is
> that the effective number of natural replicas that must acknowledge a write
> (and be consulted for a read) is unchanged; we only stop inflating blockFor
> with the pending replacement.
> > For example, in the common RF=3, QUORUM write + QUORUM read case, the
> proof shows that during a C → D replacement:
> >     • Every successful QUORUM write is still guaranteed to be stored on
> a quorum of naturals (e.g., A and B), and
> >     • Every QUORUM read—both before and after the replacement
> completes—must intersect {A, B}, so it always sees the latest value.
> > The more general argument in the ticket covers all CL pairs and shows
> that the standard condition W_eff + R_eff > RF holds (or not) exactly as
> before; the change only removes unnecessary write timeouts when the pending
> replacement is slow.
> > If you have concerns about the correctness argument, or think there are
> corner cases I’m missing (e.g., particular CL combinations or topology
> transitions), I’d really appreciate feedback on the JIRA or in this thread.
> > Thanks,
> > Runtian
> >
> > On Tue, Nov 25, 2025 at 4:44 PM Runtian Liu <[email protected]> wrote:
> > Hi everyone,
> > I’d like to start a discussion about adjusting how Cassandra calculates
> blockFor during node replacements. The JIRA tracking this proposal is here:
> > https://issues.apache.org/jira/browse/CASSANDRA-20993
> > Problem Background
> > Today, during a replacement, the pending replica is always included when
> determining the required acknowledgments. For example, with RF=3 and
> LOCAL_QUORUM, the coordinator waits for three responses instead of two.
> Since replacement nodes are often bootstrapping and slow to respond, this
> can result in write timeouts or increased write latency—even though the
> client only requested acknowledgments from the natural replicas.
> > This behavior effectively breaks the client contract by requiring more
> responses than the specified consistency level.
> > Proposed Change
> > For replacement scenarios only, exclude pending replicas from blockFor
> and require acknowledgments solely from natural replicas. Pending nodes
> will still receive writes, but their responses will not count toward
> satisfying the consistency level.
> > Responses from the node being replaced would also be ignored. Although
> it is uncommon for a replaced node to become reachable again, adding this
> safeguard avoids ambiguity and ensures correctness if that situation occurs.
> > This change would be disabled by default and controlled via a feature
> flag to avoid affecting existing deployments.
> > In my view, this behavior is effectively a bug because the coordinator
> waits for more acknowledgments than the client requested, leading to
> avoidable failures or latency. Since the issue affects correctness from the
> client perspective rather than introducing new semantics, it would be
> valuable to include this fix in the 4.x branches as well, with the behavior
> disabled by default where needed.
> > Motivation
> > This change:
> >     • Prevents unnecessary write timeouts during replacements
> >
> >     • Reduces write latency by eliminating dependence on a busy pending
> replica
> >
> >     • Aligns server behavior with client expectations
> > Current Status
> > A PR for 4.1 is available here for review:
> > https://github.com/apache/cassandra/pull/4494
> > Feedback is welcome on both the implementation and the approach.
> > Next Steps
> > I’d appreciate input on:
> >     • Any correctness concerns for replacement scenarios
> >
> >     • Whether a feature-flagged approach is acceptable
> >
> > Thanks in advance for your feedback,
> > Runtian
> >
> >
>
>

Re: [DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Reply via email to