Runtian opened a new pull request, #4711:
URL: https://github.com/apache/cassandra/pull/4711
…ting node to be marked down and the liveness check to be ineffective
assassinateEndpoint (since CASSANDRA-15059) ran entirely inside
runInGossipStageBlocking, including a 30-second RING_DELAY sleep. This blocked
the single-threaded GOSSIP stage, causing two issues:
1. Liveness check is ineffective — the target's heartbeat cannot be
updated while the GOSSIP stage is sleeping, so the check always passes, even
for live nodes.
2. Executing node marked DOWN — peers' failure detectors convict the
executor because its GOSSIP stage is unresponsive for ~34s.
Fix: Move the heartbeat snapshot and sleep onto the caller (JMX) thread,
keeping the GOSSIP stage free. Only enter the GOSSIP stage briefly to verify
the heartbeat and perform the assassination. The
post-assassination propagation wait is also moved to the caller thread.
```
The [Cassandra
Jira](https://issues.apache.org/jira/projects/CASSANDRA/issues/CASSANDRA-21249)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]