[ 
https://issues.apache.org/jira/browse/CASSANDRA-20993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040724#comment-18040724
 ] 

Runtian Liu commented on CASSANDRA-20993:
-----------------------------------------

4.1 PR: https://github.com/apache/cassandra/pull/4494

> Unnecessary write timeouts during node replacement due to conservative 
> pending node accounting in consistency level calculations
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Runtian Liu
>            Priority: Normal
>
> h3. Background
> Cassandra includes pending nodes in the write quorum calculation to ensure 
> consistency level guarantees are maintained during topology changes (e.g., 
> bootstrap, node replacement). This is implemented in 
> ConsistencyLevel.blockForWrite() where pending replicas are added to the 
> required blockFor count.
> h3. Current Behavior
> When a node is bootstrapping or being added to the cluster, all pending 
> replicas are unconditionally added to the blockFor requirement, regardless of 
> whether they are replacement nodes or new capacity additions.Example with 
> RF=3, CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 
> 2 (quorum) → with pending = 3 
>  * Node replacement (old node being replaced): blockFor = 2 → with pending = 
> 3 
> h3. Problem Statement
> During a node replacement scenario where a pending node is joining to replace 
> an existing node:Current situation: * Natural replicas: A, B, C (C is being 
> replaced)
>  * Pending replica: D (replacement for C)
>  * Live replicas: A, B, D (3 live nodes after C goes down)
>  * Required ACKs: 3 (base quorum of 2 + 1 pending)
> The issue: # Although we have 3 live replicas capable of responding, the 
> blockFor requirement of 3 is artificially inflated
>  # The pending replica (D) is typically busy during bootstrap and may respond 
> slowly
>  # Even though we've received ACKs from A and B (satisfying quorum of natural 
> replicas), the write blocks waiting for an ACK from the slow pending replica D
>  # This unnecessary dependency on the replacement pending replica's 
> responsiveness causes write timeouts during node replacement operations
> h3. Root Cause
> As the original comment in the ticket 
> https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
> {quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in 
> case bootstrap aborts). This requires treating the old/new range owner as a 
> unit: both D *and* C need to accept the write for it to count towards CL. So 
> rather than considering
> Unknown macro: \{A, B, C, D}
> we should consider
> Unknown macro: \{A, B, (C, D)}
> This is a lot of complexity to introduce.
> {quote}
> the current implementation is a "simplification" idea.
> The current implementation conflates natural and pending replicas into a 
> single blockFor calculation: * blockFor = quorum(all replicas including 
> pending)
> However, the two replica types serve different purposes:
>       * Natural replicas: The authoritative owners of the data (by 
> replication strategy)
>  * Pending replicas: Temporary nodes either adding capacity (new bootstrap) 
> or replacing an existing node (node replacement)
> The current approach treats all pending nodes identically, but they should be 
> handled differently based on their topology role.
> h3. Proposed Solution
> For quorum consistency level:
> Decouple blockFor calculation into separate requirements for natural and 
> pending replicas:For normal operations and new bootstrap (Keep current 
> behavior as the implementation might be complex and we can do later): * 
> blockFor = quorum(natural replicas) + all pending replicas
>  * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
> respond
>  * Protects against topology change cancellation
> {color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
> quorum(natural replicas) only{color}
>  * {color:#de350b}This ensures: only quorum of natural replicas need to 
> respond{color}
>  * {color:#de350b}Pending replacement node responds when available but is not 
> required{color}
>  * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
> during replacement{color}
> h3. Implementation
> For normal node bootstrap, we will block as what we are doing now as it is 
> too complicate to determine the unit of a (new owner and old owner).
> For node replacements, as the old node is down, the (new node + old node) 
> unit cannot response the write anyway. We will just block for the consistency 
> level(natural replicas).
> Only natural replicas' responses will be counted. Note, the response from the 
> old node should not be counted as the old node should be shut down. If 
> somehow it was able to ack writes, we should still ignore it.
> The implementation should not be complicated as the node replacement 
> relationship is clear.
>  
> As this is changing the behavior of the writes when there are pending nodes, 
> the change will be added with a new config and by default the feature will be 
> disabled.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to