Runtian Liu created CASSANDRA-21200:
---------------------------------------

             Summary: Replica Slot Grouping: Prevent write availability 
degradation during topology changes
                 Key: CASSANDRA-21200
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21200
             Project: Apache Cassandra
          Issue Type: New Feature
          Components: Consistency/Coordination
            Reporter: Runtian Liu
            Assignee: Runtian Liu


Summary

During topology changes (bootstrap, decommission, move, replacement), Cassandra 
adds pending replicas to the write path and inflates {{blockFor}} by 
{{{}pending.size(){}}}. This means writes must wait for acknowledgments from 
both natural and pending replicas before succeeding. If the pending replica is 
slow or unresponsive (common during streaming-heavy transitions), writes time 
out even though the natural quorum is fully available.

This issue affects both regular writes and Lightweight Transactions / Paxos V2.

Problem

Consider a 3-node cluster with RF=3 and QUORUM writes ({{{}blockFor = 2{}}}). 
When a 4th node begins bootstrapping:
 * The pending replica is added to the write set, inflating {{blockFor}} to 3
 * Writes now require 3 acknowledgments instead of 2
 * If the bootstrapping node is busy streaming and responds slowly, writes time 
out
 * The cluster effectively loses write availability during what should be a 
routine operation

The same issue occurs during node replacement, decommission, and range 
movement. The fundamental problem is that {{blockFor}} inflation treats the 
pending replica as an independent additional requirement, when logically it is 
_transitioning into_ an existing replica's slot.

Proposed Solution: Replica Slot Grouping

Instead of inflating {{{}blockFor{}}}, group endpoints that are transitioning 
the same logical replica position into a single "slot":
 * A stable slot contains a single endpoint (no topology change in progress). 
Requires 1 ack.
 * A transitioning slot contains two endpoints (e.g., a leaving node and its 
replacement, or a bootstrapping node and the existing owner). Requires acks 
from all members (both old and new) for the slot to count as satisfied.

{{blockFor}} stays at the base quorum level (no inflation from pending 
replicas). The quorum is counted in terms of slots rather than individual 
endpoints.

Example: In the 3-node + 1 bootstrapping scenario above:
 * 3 slots total: 2 stable (nodes that are unaffected) + 1 transitioning 
(existing owner + bootstrapping node)
 * QUORUM still requires 2 slots
 * If the bootstrapping node is slow, only 2 stable slots respond — that's 2 
slots, meeting quorum
 * If the bootstrapping node responds, the transitioning slot is satisfied, 
also meeting quorum
 * In both cases, writes succeed without timeout

Correctness properties:
 * Durability is maintained: when the transitioning slot is satisfied, writes 
reach both old and new replicas
 * Consistency is maintained: quorum intersection properties are preserved 
because slot-based quorum >= natural quorum
 * Availability is improved: writes no longer block on slow/unresponsive 
pending replicas

Implementation across branches: TCM vs pre-TCM

In trunk (with TCM), the Transient Cluster Metadata framework already maintains 
explicit transition state for each replica — the information about which 
endpoints are joining, leaving, or replacing is directly available in the 
cluster metadata. The slot grouping logic can consume this transition 
information directly without any additional calculation.

In pre-TCM branches (4.x, 5.x), this transition information does not exist as a 
first-class concept. The implementation needs to derive it by analyzing 
{{TokenMetadata}} — computing which pending endpoints map to which existing 
endpoints based on bootstrap tokens, leaving tokens, and the replication 
strategy. This calculation is performed on topology changes and cached per 
keyspace, so the hot write path only performs a map lookup.

Feature flag:
 * {{replica_slot_grouping_enabled}} (cassandra.yaml, runtime-toggleable via 
JMX)
 * Default: disabled (preserves existing behavior)
 * When disabled: zero code path changes, no performance impact

Compatibility:
 * Purely additive; no wire protocol or schema changes
 * Feature flag ensures safe rollout and rollback
 * Can be backported to 4.x and 5.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to