[jira] [Commented] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Runtian Liu (Jira) Wed, 03 Dec 2025 09:31:09 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042577#comment-18042577
 ]


Runtian Liu commented on CASSANDRA-20993:
-----------------------------------------

h4. 1. Notation and Model

For a given token range:
 * *Natural replicas (before replacement):* {{N_old = \{A, B, C}}}

 * *Pending replacement:* {{D}} (replacing C)

 * *Natural replicas (after replacement):* {{N_new = \{A, B, D}}}

For any successful write {{{}W{}}}:
 * Let {{W_nat}} be the set of *natural replicas* that stored {{W}} at commit 
time.

 * For a given write CL, let {{W_eff}} be the *minimum size* of {{W_nat}} 
required for success.

 * For a given read CL, let {{R_eff}} be the *minimum number* of natural 
replicas whose responses the coordinator must receive before completing the 
read.

Standard quorum theory says:
{quote}A (write CL, read CL) pair guarantees read-after-write (RYW) if
{{{}W_eff + R_eff > RF{}}}.
{quote}
This is the usual Cassandra argument, independent of pending nodes.
----
h4. 2. Behavior Change Introduced by 20993 (Replacement Only)

During node replacement:
 * Writes are still *sent* to all natural replicas and to the pending 
replacement ({{{}A, B, C?, D{}}}).

 * *Old behavior:* write {{blockFor}} is inflated to include the pending 
replica (e.g., QUORUM effectively requires {{{}2 naturals + D{}}}).

 * *New behavior (20993):*

 ** {{blockFor}} is computed *only over natural replicas* (according to the 
current ring).

 ** Responses from the node being replaced ({{{}C{}}}) are *ignored* if they 
arrive.

 ** Pending replacement {{D}} *does not count* towards {{{}blockFor{}}}, 
although it still receives the write if possible.

Crucially:
{quote}For every consistency level, the *set of natural replicas that must ACK* 
for a write to succeed is unchanged.
Only the extra requirement on the pending node D is removed.
{quote}
Reads are unchanged: they always use the natural replica set from the ring view 
that the coordinator holds ({{{}N_old{}}} before, {{N_new}} after).
----
h4. 3. General Safety for All (Write CL, Read CL) Pairs

For each write CL:
 * Define {{W_eff}} as in a normal (no-replacement) cluster:

 ** CL=ANY → {{W_eff}} may be 0 naturals (hint only).

 ** CL=ONE / LOCAL_ONE → {{{}W_eff = 1{}}}.

 ** CL=TWO → {{{}W_eff = 2{}}}.

 ** CL=THREE / ALL → {{{}W_eff = 3{}}}.

 ** CL=QUORUM / LOCAL_QUORUM → {{{}W_eff = 2{}}}.

 ** CL=EACH_QUORUM → quorum per DC (similar reasoning per DC).

Under 20993:
 * A write succeeds *iff* at least {{W_eff}} natural replicas in {{{A,B,C}}} 
ACK.

 * This is exactly the same requirement as in steady state (without 
replacement).

 * The fact that D is no longer required does *not* reduce {{W_eff}} or change 
which naturals must store the write.

For each read CL:
 * {{R_eff}} is also unchanged (reads never depended on pending replicas).

Therefore, for any (write CL, read CL) pair:
 * The RYW condition {{W_eff + R_eff > RF}} holds for *exactly the same pairs* 
as before.

 * If a pair guaranteed RYW before 20993, it still does.

 * If a pair did not guarantee RYW before (e.g., ONE + QUORUM with RF=3), it 
still does not.

The change only removes timeouts caused by waiting for the bootstrapping 
replacement node; it does not create new “successful but unsafe” writes.
----
h4. 4. Detailed Proof for QUORUM Write + QUORUM Read

We now give an explicit set-intersection proof for the most important case:
{quote}*Write CL = QUORUM (or LOCAL_QUORUM)*
*Read CL = QUORUM (or LOCAL_QUORUM)*
*RF = 3, replacement C → D*
{quote}
h5. 4.1 Invariant for QUORUM Writes During Replacement

Under 20993, for RF=3:
 * Natural set before replacement: {{{}N_old = \{A, B, C}{}}}.

 * QUORUM over {{N_old}} means: {{{}blockFor = 2{}}}.

During replacement:
 * Coordinator sends writes to {{{}A{}}}, {{{}B{}}}, {{D}} (and possibly {{C}} 
if still in ring).

 * For a QUORUM write to succeed, the coordinator must receive *2 ACKs from 
naturals* in {{{}N_old{}}}.

Implementation-wise, this is typically {{{}{A, B}{}}}. Thus we obtain:
{quote}*Invariant W_quorum:* For every successful QUORUM (or LOCAL_QUORUM) 
write during replacement, both *A and B* store the latest value.
{quote}
We do *not* rely on C or D having the write for success.
h5. 4.2 QUORUM Read Before Replacement Completes

Before the replacement is committed in the ring:
 * Natural set for reads is still {{{}N_old = \{A, B, C}{}}}.

 * A QUORUM read queries any 2 replicas from {{{}N_old{}}}.

Possible quorum sets:
 * {{{A, B}}}

 * {{{A, C}}}

 * {{{B, C}}}

By {*}Invariant W_quorum{*}, both A and B have the latest value. Therefore, 
every read quorum intersects {{{}{A,B}{}}}:
 * {{{A,B}}} → both have the write.

 * {{{A,C}}} → A has the write.

 * {{{B,C}}} → B has the write.

Hence:
{quote}Any successful QUORUM read *before* replacement completes will see the 
latest write.
{quote}
h5. 4.3 QUORUM Read After Replacement Completes

After the ring is updated:
 * Natural set becomes {{N_new = \{A, B, D}}} (C is removed).

 * A QUORUM read queries any 2 replicas from {{{}N_new{}}}.

Possible quorum sets:
 * {{{A, B}}}

 * {{{A, D}}}

 * {{{B, D}}}

Again, by {*}Invariant W_quorum{*}, both A and B have the latest write, 
regardless of D’s state. Every read quorum intersects {{{}{A,B}{}}}:
 * {{{A,B}}} → both have the write.

 * {{{A,D}}} → A has the write.

 * {{{B,D}}} → B has the write.

Thus:
{quote}Any successful QUORUM read *after* replacement completes will also see 
the latest write.
{quote}
h5. 4.4 Mixed Topology Views (Ring Convergence)

There may be a brief period where some coordinators still see {{N_old}} and 
others see {{{}N_new{}}}. For either view:
 * Quorums over {{N_old = \{A,B,C}}} intersect {{{A,B}}} (as shown above).

 * Quorums over {{N_new = \{A,B,D}}} also intersect {{{}{A,B}{}}}.

Therefore, regardless of which view a coordinator uses, a QUORUM read must 
intersect {{{A,B}}} and thus observe the latest value.
----
h4. 5. Summary
 * For all consistency levels, the *effective number of natural replicas* that 
must store a successful write ({{{}W_eff{}}}) is unchanged by this patch.

 * The *number of natural replicas* that must reply to a read ({{{}R_eff{}}}) 
is also unchanged.

 * Therefore, for any (write CL, read CL), the standard RYW condition {{W_eff + 
R_eff > RF}} holds (or not) exactly as before.

 * In particular, for {*}QUORUM write + QUORUM read{*}, we showed explicitly 
that all read quorums (before, during, and after replacement) intersect 
{{{}{A,B}{}}}, the set of nodes that definitely applied the write.

This change *improves availability* (fewer write timeouts during replacement) 
while preserving the existing read-after-write guarantees for all consistency 
level combinations.
 
 
 

> Unnecessary write timeouts during node replacement due to conservative 
> pending node accounting in consistency level calculations
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Runtian Liu
>            Priority: Normal
>
> h3. Background
> Cassandra includes pending nodes in the write quorum calculation to ensure 
> consistency level guarantees are maintained during topology changes (e.g., 
> bootstrap, node replacement). This is implemented in 
> ConsistencyLevel.blockForWrite() where pending replicas are added to the 
> required blockFor count.
> h3. Current Behavior
> When a node is bootstrapping or being added to the cluster, all pending 
> replicas are unconditionally added to the blockFor requirement, regardless of 
> whether they are replacement nodes or new capacity additions.Example with 
> RF=3, CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 
> 2 (quorum) → with pending = 3 
>  * Node replacement (old node being replaced): blockFor = 2 → with pending = 
> 3 
> h3. Problem Statement
> During a node replacement scenario where a pending node is joining to replace 
> an existing node:Current situation: * Natural replicas: A, B, C (C is being 
> replaced)
>  * Pending replica: D (replacement for C)
>  * Live replicas: A, B, D (3 live nodes after C goes down)
>  * Required ACKs: 3 (base quorum of 2 + 1 pending)
> The issue: # Although we have 3 live replicas capable of responding, the 
> blockFor requirement of 3 is artificially inflated
>  # The pending replica (D) is typically busy during bootstrap and may respond 
> slowly
>  # Even though we've received ACKs from A and B (satisfying quorum of natural 
> replicas), the write blocks waiting for an ACK from the slow pending replica D
>  # This unnecessary dependency on the replacement pending replica's 
> responsiveness causes write timeouts during node replacement operations
> h3. Root Cause
> As the original comment in the ticket 
> https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
> {quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in 
> case bootstrap aborts). This requires treating the old/new range owner as a 
> unit: both D *and* C need to accept the write for it to count towards CL. So 
> rather than considering
> Unknown macro: \{A, B, C, D}
> we should consider
> Unknown macro: \{A, B, (C, D)}
> This is a lot of complexity to introduce.
> {quote}
> the current implementation is a "simplification" idea.
> The current implementation conflates natural and pending replicas into a 
> single blockFor calculation: * blockFor = quorum(all replicas including 
> pending)
> However, the two replica types serve different purposes:
>       * Natural replicas: The authoritative owners of the data (by 
> replication strategy)
>  * Pending replicas: Temporary nodes either adding capacity (new bootstrap) 
> or replacing an existing node (node replacement)
> The current approach treats all pending nodes identically, but they should be 
> handled differently based on their topology role.
> h3. Proposed Solution
> For quorum consistency level:
> Decouple blockFor calculation into separate requirements for natural and 
> pending replicas:For normal operations and new bootstrap (Keep current 
> behavior as the implementation might be complex and we can do later): * 
> blockFor = quorum(natural replicas) + all pending replicas
>  * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
> respond
>  * Protects against topology change cancellation
> {color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
> quorum(natural replicas) only{color}
>  * {color:#de350b}This ensures: only quorum of natural replicas need to 
> respond{color}
>  * {color:#de350b}Pending replacement node responds when available but is not 
> required{color}
>  * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
> during replacement{color}
> h3. Implementation
> For normal node bootstrap, we will block as what we are doing now as it is 
> too complicate to determine the unit of a (new owner and old owner).
> For node replacements, as the old node is down, the (new node + old node) 
> unit cannot response the write anyway. We will just block for the consistency 
> level(natural replicas).
> Only natural replicas' responses will be counted. Note, the response from the 
> old node should not be counted as the old node should be shut down. If 
> somehow it was able to ack writes, we should still ignore it.
> The implementation should not be complicated as the node replacement 
> relationship is clear.
>  
> As this is changing the behavior of the writes when there are pending nodes, 
> the change will be added with a new config and by default the feature will be 
> disabled.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Reply via email to