Benedict Elliott Smith created CASSANDRA-20878:
--------------------------------------------------
Summary: Improve Accord Observability
Key: CASSANDRA-20878
URL: https://issues.apache.org/jira/browse/CASSANDRA-20878
Project: Apache Cassandra
Issue Type: Improvement
Components: Accord
Reporter: Benedict Elliott Smith
Improve Observability:
- Track all active Coordinations
- Refactor Replica/Coordinator metrics and report Coordinator
exhausted/preempted/timeout
- DurabilityQueue metrics and visibility
Also Fix:
- WaitingState can get cause distributed stall when asked to wait for
CanApply if not yet PreCommitted; track separate querying state and advance
this to the next achievable state rather than the desired final state
- Stalled coordinators should not prevent recovery
- Edge case with fetch unable to make progress when pre-bootstrap and all
peers have GC'd
- Dependency initialisation for sync points across certain ownership
changes
- SyncPoint propagation may not include all of the epochs required on the
receiving node for ranges they have lost but not closed, and receiving node
does not validate them
- Stable tracker accounting with LocalExecute
- Do not prune non-durable APPLIED as must be reported in dependencies
until durably applied (so as not to break recovery)
- Ensure we cannot race with replies when initiating Coordination
- ProgressLog does not guarantee to clear home or waiting states when
erased or invalidated by compaction
- WaitingState on non-home shard cannot guarantee progress once home shard
is Erased
- WaitingOnSync handles retired ranges incorrectly
Also Improve:
- Standardise failure accounting, use null to represent single reply
timeouts
- BurnTest record/replay to/from file
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]