[
https://issues.apache.org/jira/browse/CASSANDRA-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17710169#comment-17710169
]
Maxim Chanturiay commented on CASSANDRA-18120:
----------------------------------------------
[~dsarisky], hi!
What do you think about the following solution?
Introduce a new method to the FailureDetector
(org.apache.cassandra.gms.FailureDetector):
{code:java}
orderEndpointsByArrivalTimeDiff(List<InetAddressAndPort> endpoints)
{code}
The method will order the arg endpoints the ascending order by their "PHI"
value.
* The first enpoints are with the lowest PHI value
* Followed by the ones with the greater PHI values
* And lastly the ones with no existing "PHI" value.
A couple of assumptions here:
* FailureDetector instance is a single place where "PHI" values are
calculated. I've validated it by finding no other class that implements the
IFailureDetector interface (org.apache.cassandra.gms.IFailureDetector).
* The lower PHI value is, the faster responses from a node arrive. It looks
this way from CASSANDRA-2597 explanation regarding ArrivalWindow and the
[official PHI
paper|https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf].
With the new method, we can replace the Collections.shuffle() called during
ReplicaPlans->filterBatchlogEndpoints() with a different piece of code.
It will acquire the DESC sorted endpoints by their PHI value and pick the first
endpoints.
> Single slow node dramatically reduces cluster write throughput regardless of
> CL
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-18120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18120
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Dan Sarisky
> Assignee: Maxim Chanturiay
> Priority: Normal
>
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO,
> QUORUM, or LOCAL_QUORUM)
>
> On clusters of any size - a single extremely slow node causes a ~90% loss of
> cluster-wide throughput using batched writes. We can replicate this in the
> lab via CPU or disk throttling. I observe this in 3.11, 4.0, and 4.1.
>
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the
> actual mutations aren't processed until those two nodes acknowledge the batch
> statements. Those replica nodes are selected randomly from all nodes in the
> local data center currently up in gossip. If a single node is slow, but
> still thought to be up in gossip, this eventually causes every other node to
> have all of its MutationStages to be waiting while the slow replica accepts
> batch writes.
>
> The code in play appears to be:
> See
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245].
> In the method filterBatchlogEndpoints() there is a
> Collections.shuffle() to order the endpoints and a
> FailureDetector.isEndpointAlive() to test if the endpoint is acceptable.
>
> This behavior causes Cassandra to move from a multi-node fault tolerant
> system toa collection of single points of failure.
>
> We try to take administrator actions to kill off the extremely slow nodes,
> but it would be great to have some notion of "what node is a bad choice" when
> writing log batches to replica nodes.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]