[
https://issues.apache.org/jira/browse/CASSANDRA-20993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040724#comment-18040724
]
Runtian Liu commented on CASSANDRA-20993:
-----------------------------------------
4.1 PR: https://github.com/apache/cassandra/pull/4494
> Unnecessary write timeouts during node replacement due to conservative
> pending node accounting in consistency level calculations
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-20993
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Consistency/Bootstrap and Decommission
> Reporter: Runtian Liu
> Priority: Normal
>
> h3. Background
> Cassandra includes pending nodes in the write quorum calculation to ensure
> consistency level guarantees are maintained during topology changes (e.g.,
> bootstrap, node replacement). This is implemented in
> ConsistencyLevel.blockForWrite() where pending replicas are added to the
> required blockFor count.
> h3. Current Behavior
> When a node is bootstrapping or being added to the cluster, all pending
> replicas are unconditionally added to the blockFor requirement, regardless of
> whether they are replacement nodes or new capacity additions.Example with
> RF=3, CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor =
> 2 (quorum) → with pending = 3
> * Node replacement (old node being replaced): blockFor = 2 → with pending =
> 3
> h3. Problem Statement
> During a node replacement scenario where a pending node is joining to replace
> an existing node:Current situation: * Natural replicas: A, B, C (C is being
> replaced)
> * Pending replica: D (replacement for C)
> * Live replicas: A, B, D (3 live nodes after C goes down)
> * Required ACKs: 3 (base quorum of 2 + 1 pending)
> The issue: # Although we have 3 live replicas capable of responding, the
> blockFor requirement of 3 is artificially inflated
> # The pending replica (D) is typically busy during bootstrap and may respond
> slowly
> # Even though we've received ACKs from A and B (satisfying quorum of natural
> replicas), the write blocks waiting for an ACK from the slow pending replica D
> # This unnecessary dependency on the replacement pending replica's
> responsiveness causes write timeouts during node replacement operations
> h3. Root Cause
> As the original comment in the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
> {quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in
> case bootstrap aborts). This requires treating the old/new range owner as a
> unit: both D *and* C need to accept the write for it to count towards CL. So
> rather than considering
> Unknown macro: \{A, B, C, D}
> we should consider
> Unknown macro: \{A, B, (C, D)}
> This is a lot of complexity to introduce.
> {quote}
> the current implementation is a "simplification" idea.
> The current implementation conflates natural and pending replicas into a
> single blockFor calculation: * blockFor = quorum(all replicas including
> pending)
> However, the two replica types serve different purposes:
> * Natural replicas: The authoritative owners of the data (by
> replication strategy)
> * Pending replicas: Temporary nodes either adding capacity (new bootstrap)
> or replacing an existing node (node replacement)
> The current approach treats all pending nodes identically, but they should be
> handled differently based on their topology role.
> h3. Proposed Solution
> For quorum consistency level:
> Decouple blockFor calculation into separate requirements for natural and
> pending replicas:For normal operations and new bootstrap (Keep current
> behavior as the implementation might be complex and we can do later): *
> blockFor = quorum(natural replicas) + all pending replicas
> * This ensures: quorum of natural replicas respond, PLUS any pending nodes
> respond
> * Protects against topology change cancellation
> {color:#de350b}For node replacement:{color} * {color:#de350b}blockFor =
> quorum(natural replicas) only{color}
> * {color:#de350b}This ensures: only quorum of natural replicas need to
> respond{color}
> * {color:#de350b}Pending replacement node responds when available but is not
> required{color}
> * {color:#de350b}Eliminates unnecessary dependency on busy pending replica
> during replacement{color}
> h3. Implementation
> For normal node bootstrap, we will block as what we are doing now as it is
> too complicate to determine the unit of a (new owner and old owner).
> For node replacements, as the old node is down, the (new node + old node)
> unit cannot response the write anyway. We will just block for the consistency
> level(natural replicas).
> Only natural replicas' responses will be counted. Note, the response from the
> old node should not be counted as the old node should be shut down. If
> somehow it was able to ack writes, we should still ignore it.
> The implementation should not be complicated as the node replacement
> relationship is clear.
>
> As this is changing the behavior of the writes when there are pending nodes,
> the change will be added with a new config and by default the feature will be
> disabled.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]