Runtian Liu created CASSANDRA-20993:
---------------------------------------
Summary: Unnecessary write timeouts during node replacement due to
conservative pending node accounting in consistency level calculations
Key: CASSANDRA-20993
URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
Project: Apache Cassandra
Issue Type: Improvement
Components: Consistency/Bootstrap and Decommission
Reporter: Runtian Liu
h3. Background
Cassandra includes pending nodes in the write quorum calculation to ensure
consistency level guarantees are maintained during topology changes (e.g.,
bootstrap, node replacement). This is implemented in
ConsistencyLevel.blockForWrite() where pending replicas are added to the
required blockFor count.
h3. Current Behavior
When a node is bootstrapping or being added to the cluster, all pending
replicas are unconditionally added to the blockFor requirement, regardless of
whether they are replacement nodes or new capacity additions.Example with RF=3,
CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 2
(quorum) → with pending = 3
* Node replacement (old node being replaced): blockFor = 2 → with pending = 3
h3. Problem Statement
During a node replacement scenario where a pending node is joining to replace
an existing node:Current situation: * Natural replicas: A, B, C (C is being
replaced)
* Pending replica: D (replacement for C)
* Live replicas: A, B, D (3 live nodes after C goes down)
* Required ACKs: 3 (base quorum of 2 + 1 pending)
The issue: # Although we have 3 live replicas capable of responding, the
blockFor requirement of 3 is artificially inflated
# The pending replica (D) is typically busy during bootstrap and may respond
slowly
# Even though we've received ACKs from A and B (satisfying quorum of natural
replicas), the write blocks waiting for an ACK from the slow pending replica D
# This unnecessary dependency on the replacement pending replica's
responsiveness causes write timeouts during node replacement operations
h3. Root Cause
As the original comment in the ticket
https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned: ```
we want to satisfy CL for both the pre- and post-bootstrap nodes (in case
bootstrap aborts). This requires treating the old/new range owner as a unit:
both D *and* C need to accept the write for it to count towards CL. So rather
than considering
{A, B, C, D}
we should consider
{A, B, (C, D)}
This is a lot of complexity to introduce.```,
the current implementation is a "simplification" idea.
The current implementation conflates natural and pending replicas into a single
blockFor calculation: * blockFor = quorum(all replicas including pending)
However, the two replica types serve different purposes: * Natural replicas:
The authoritative owners of the data (by replication strategy)
* Pending replicas: Temporary nodes either adding capacity (new bootstrap) or
replacing an existing node (node replacement)
The current approach treats all pending nodes identically, but they should be
handled differently based on their topology role.
h3. Proposed Solution
Decouple blockFor calculation into separate requirements for natural and
pending replicas:For normal operations and new bootstrap (Keep current behavior
as the implementation might be complex and we can do later): * blockFor =
quorum(natural replicas) + all pending replicas
* This ensures: quorum of natural replicas respond, PLUS any pending nodes
respond
* Protects against topology change cancellation
{color:#de350b}For node replacement:{color} * {color:#de350b}blockFor =
quorum(natural replicas) only{color}
* {color:#de350b}This ensures: only quorum of natural replicas need to
respond{color}
* {color:#de350b}Pending replacement node responds when available but is not
required{color}
* {color:#de350b}Eliminates unnecessary dependency on busy pending replica
during replacement{color}
h3. Implementation
For normal node bootstrap, we will block as what we are doing now as it is too
complicate to determine the unit of a (new owner and old owner).
For node replacements, as the old node is down, the (new node + old node) unit
cannot response the write anyway. We will just block for the consistency
level(natural replicas).
The implementation should not be complicated as the node replacement
relationship is clear.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]