[jira] [Created] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Runtian Liu (Jira) Thu, 30 Oct 2025 12:00:06 -0700

Runtian Liu created CASSANDRA-20993:
---------------------------------------


             Summary: Unnecessary write timeouts during node replacement due to 
conservative pending node accounting in consistency level calculations
                 Key: CASSANDRA-20993
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
             Project: Apache Cassandra
          Issue Type: Improvement
          Components: Consistency/Bootstrap and Decommission
            Reporter: Runtian Liu


h3. Background
Cassandra includes pending nodes in the write quorum calculation to ensure 
consistency level guarantees are maintained during topology changes (e.g., 
bootstrap, node replacement). This is implemented in 
ConsistencyLevel.blockForWrite() where pending replicas are added to the 
required blockFor count.
h3. Current Behavior
When a node is bootstrapping or being added to the cluster, all pending 
replicas are unconditionally added to the blockFor requirement, regardless of 
whether they are replacement nodes or new capacity additions.Example with RF=3, 
CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 2 
(quorum) → with pending = 3 

 * Node replacement (old node being replaced): blockFor = 2 → with pending = 3 

h3. Problem Statement
During a node replacement scenario where a pending node is joining to replace 
an existing node:Current situation: * Natural replicas: A, B, C (C is being 
replaced)

 * Pending replica: D (replacement for C)

 * Live replicas: A, B, D (3 live nodes after C goes down)

 * Required ACKs: 3 (base quorum of 2 + 1 pending)

The issue: # Although we have 3 live replicas capable of responding, the 
blockFor requirement of 3 is artificially inflated

 # The pending replica (D) is typically busy during bootstrap and may respond 
slowly

 # Even though we've received ACKs from A and B (satisfying quorum of natural 
replicas), the write blocks waiting for an ACK from the slow pending replica D

 # This unnecessary dependency on the replacement pending replica's 
responsiveness causes write timeouts during node replacement operations

h3. Root Cause

As the original comment in the ticket 
https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned: ```

we want to satisfy CL for both the pre- and post-bootstrap nodes (in case 
bootstrap aborts). This requires treating the old/new range owner as a unit: 
both D *and* C need to accept the write for it to count towards CL. So rather 
than considering

{A, B, C, D}

we should consider

{A, B, (C, D)}

This is a lot of complexity to introduce.```,

the current implementation is a "simplification" idea.
The current implementation conflates natural and pending replicas into a single 
blockFor calculation: * blockFor = quorum(all replicas including pending)

However, the two replica types serve different purposes: * Natural replicas: 
The authoritative owners of the data (by replication strategy)

 * Pending replicas: Temporary nodes either adding capacity (new bootstrap) or 
replacing an existing node (node replacement)

The current approach treats all pending nodes identically, but they should be 
handled differently based on their topology role.
h3. Proposed Solution
Decouple blockFor calculation into separate requirements for natural and 
pending replicas:For normal operations and new bootstrap (Keep current behavior 
as the implementation might be complex and we can do later): * blockFor = 
quorum(natural replicas) + all pending replicas

 * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
respond

 * Protects against topology change cancellation

{color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
quorum(natural replicas) only{color}

 * {color:#de350b}This ensures: only quorum of natural replicas need to 
respond{color}

 * {color:#de350b}Pending replacement node responds when available but is not 
required{color}
 * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
during replacement{color}

h3. Implementation

For normal node bootstrap, we will block as what we are doing now as it is too 
complicate to determine the unit of a (new owner and old owner).

For node replacements, as the old node is down, the (new node + old node) unit 
cannot response the write anyway. We will just block for the consistency 
level(natural replicas).

The implementation should not be complicated as the node replacement 
relationship is clear.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Reply via email to