Hi all, First, thanks to Mick for summarizing the Slack discussion and helping clarify the remaining concerns — that was very helpful.
One point I wanted to follow up on: I understand that for trunk there were mentions of *other potential changes* that might also address issues observed during node replacement. I want to make sure I’m not missing anything important here. So far, I haven’t seen a concrete alternative that directly addresses the *LOCAL_QUORUM write timeouts / Unavailable errors* caused by inflating blockFor with pending replacements. The current proposal fixes this specific failure mode in a targeted way, without changing semantics or weakening correctness guarantees. If there is a specific alternative approach being considered for trunk (e.g., a different invariant or coordinator-side rule), I’d really appreciate it being spelled out explicitly so we can evaluate it side-by-side. Absent a concrete alternative that solves the same problem, my preference would be to move forward with the current change. At this point, it would be very helpful to get *more eyes on the ticket and the PR* so we can converge on a decision: - JIRA: https://issues.apache.org/jira/browse/CASSANDRA-20993 - PR (4.1): https://github.com/apache/cassandra/pull/4494 Any reviews or directional feedback—especially on trunk vs feature-flag handling—would be greatly appreciated. Thanks, Runtian On Sun, Dec 21, 2025 at 5:12 AM Mick <[email protected]> wrote: > FYI there's a healthy slack thread discussing this found here: > https://the-asf.slack.com/archives/CK23JSY2K/p1762834946972609 > > From that the concerns left (iiuc) are: > - cases where a replacing node is removed and we return the original > being-replaced node to the cluster, > - cases where multiple nodes are replacing and gossip leaves coordinators > seeing different states of some, or no, nodes as JOINING/NORMAL. > > Given these concerns: the possibilities operators are doing things in > unexpected ways; the feature flag is warranted on non-trunk branches. > > But does trunk need the flag ? It sounds like neither concern exists in > trunk but there was a desire to do it with more changes in trunk which i'm > not grokking…? > > > > > On 3 Dec 2025, at 18:32, Runtian Liu <[email protected]> wrote: > > > > Hi all, > > Just bumping this thread in case it was missed the first time. > > I’ve updated CASSANDRA-20993 with a detailed Correctness / Safety > section that explains why excluding the pending replacement node from > blockFor during node replacement does not weaken read-after-write > guarantees for any combination of write CL and read CL. The key point is > that the effective number of natural replicas that must acknowledge a write > (and be consulted for a read) is unchanged; we only stop inflating blockFor > with the pending replacement. > > For example, in the common RF=3, QUORUM write + QUORUM read case, the > proof shows that during a C → D replacement: > > • Every successful QUORUM write is still guaranteed to be stored on > a quorum of naturals (e.g., A and B), and > > • Every QUORUM read—both before and after the replacement > completes—must intersect {A, B}, so it always sees the latest value. > > The more general argument in the ticket covers all CL pairs and shows > that the standard condition W_eff + R_eff > RF holds (or not) exactly as > before; the change only removes unnecessary write timeouts when the pending > replacement is slow. > > If you have concerns about the correctness argument, or think there are > corner cases I’m missing (e.g., particular CL combinations or topology > transitions), I’d really appreciate feedback on the JIRA or in this thread. > > Thanks, > > Runtian > > > > On Tue, Nov 25, 2025 at 4:44 PM Runtian Liu <[email protected]> wrote: > > Hi everyone, > > I’d like to start a discussion about adjusting how Cassandra calculates > blockFor during node replacements. The JIRA tracking this proposal is here: > > https://issues.apache.org/jira/browse/CASSANDRA-20993 > > Problem Background > > Today, during a replacement, the pending replica is always included when > determining the required acknowledgments. For example, with RF=3 and > LOCAL_QUORUM, the coordinator waits for three responses instead of two. > Since replacement nodes are often bootstrapping and slow to respond, this > can result in write timeouts or increased write latency—even though the > client only requested acknowledgments from the natural replicas. > > This behavior effectively breaks the client contract by requiring more > responses than the specified consistency level. > > Proposed Change > > For replacement scenarios only, exclude pending replicas from blockFor > and require acknowledgments solely from natural replicas. Pending nodes > will still receive writes, but their responses will not count toward > satisfying the consistency level. > > Responses from the node being replaced would also be ignored. Although > it is uncommon for a replaced node to become reachable again, adding this > safeguard avoids ambiguity and ensures correctness if that situation occurs. > > This change would be disabled by default and controlled via a feature > flag to avoid affecting existing deployments. > > In my view, this behavior is effectively a bug because the coordinator > waits for more acknowledgments than the client requested, leading to > avoidable failures or latency. Since the issue affects correctness from the > client perspective rather than introducing new semantics, it would be > valuable to include this fix in the 4.x branches as well, with the behavior > disabled by default where needed. > > Motivation > > This change: > > • Prevents unnecessary write timeouts during replacements > > > > • Reduces write latency by eliminating dependence on a busy pending > replica > > > > • Aligns server behavior with client expectations > > Current Status > > A PR for 4.1 is available here for review: > > https://github.com/apache/cassandra/pull/4494 > > Feedback is welcome on both the implementation and the approach. > > Next Steps > > I’d appreciate input on: > > • Any correctness concerns for replacement scenarios > > > > • Whether a feature-flagged approach is acceptable > > > > Thanks in advance for your feedback, > > Runtian > > > > > >
