We've been experiencing some cluster-wide performance issues if any single
node in the cluster is performing poorly. For example this occurs if
compaction is running on any node in the cluster, or if a new node is being
bootstrapped.

We believe the root cause of this issue is a performance optimization in
Cassandra that requests the "full" data from only a single node in the
cluster, and MD5 checksums of the same data from all other nodes (depending
on the consistency level of the read) for a given request. The net effect of
this optimization is the read will block until the data is received from the
node that is replying with the full data, even if all other nodes are
responding much more quickly. Thus the entire cluster is only as fast as the
slowest node for some fraction of all requests. The portion of requests sent
to the slow node is a function of the total cluster size, replication factor
and consistency level. For smallish clusters (e.g. 10  or fewer servers)
this performance degradation can be quite pronounced.

CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981)
discusses this issue and proposes the solution of dynamically identifying
slow nodes and automatically treating them as if they were on a remote
network, thus preventing certain performance critical operations (such as
full data requests) from being performed on that node. This seems like a
fine solution.

However, a design that requires any read operation to wait on the reply from
a specific single node seems counter to the fundamental design goal of
avoiding any single points of failure. In this case, a single node with
degraded performance (but still online) can dramatically reduce the overall
performance of the cluster. The proposed solution would dynamically detect
this condition and take evasive action when the condition is detected, but
it would require some number of requests to perform poorly before a slow
node is detected. It also smells like a complex solution that could have
some unexpected side-effects and edge-cases.

I wonder if a simpler solution would be more effective here? In the same way
that hinted handoff can now be disabled via configuration, would it be
feasible to optionally turn off this optimization? This way I can make the
trade-off decision between the incremental performance improvement from this
optimization or more reliable cluster-wide performance. Ideally, I would be
able to configure how many nodes should reply with "full data" with each
request. Thus I could increase this from 1 to 2 to avoid cluster-wide
performance degradation if any single node is performing poorly. By being
able to turn off or tune this setting I would also be able to do some a/b
testing to evaluate what performance benefit is being gained by this
optimization.

I'm curious to know if anyone else has run into this issue, and if anyone
else wishes they could turn off or tune this "full data"/md5 performance
optimization?

thanks,
Mason

Reply via email to