Having a few requests time out while the service detects badness is typical in this kind of system. I don't think writing a completely separate StorageProxy + supporting classes to allow avoiding this in exchange for RF times the network bandwidth is a good idea.
On Wed, Jul 7, 2010 at 11:23 AM, Mason Hale <ma...@onespot.com> wrote: > We've been experiencing some cluster-wide performance issues if any single > node in the cluster is performing poorly. For example this occurs if > compaction is running on any node in the cluster, or if a new node is being > bootstrapped. > We believe the root cause of this issue is a performance optimization in > Cassandra that requests the "full" data from only a single node in the > cluster, and MD5 checksums of the same data from all other nodes (depending > on the consistency level of the read) for a given request. The net effect of > this optimization is the read will block until the data is received from the > node that is replying with the full data, even if all other nodes are > responding much more quickly. Thus the entire cluster is only as fast as the > slowest node for some fraction of all requests. The portion of requests sent > to the slow node is a function of the total cluster size, replication factor > and consistency level. For smallish clusters (e.g. 10 or fewer servers) > this performance degradation can be quite pronounced. > CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981) > discusses this issue and proposes the solution of dynamically identifying > slow nodes and automatically treating them as if they were on a remote > network, thus preventing certain performance critical operations (such as > full data requests) from being performed on that node. This seems like a > fine solution. > However, a design that requires any read operation to wait on the reply from > a specific single node seems counter to the fundamental design goal of > avoiding any single points of failure. In this case, a single node with > degraded performance (but still online) can dramatically reduce the overall > performance of the cluster. The proposed solution would dynamically detect > this condition and take evasive action when the condition is detected, but > it would require some number of requests to perform poorly before a slow > node is detected. It also smells like a complex solution that could have > some unexpected side-effects and edge-cases. > I wonder if a simpler solution would be more effective here? In the same way > that hinted handoff can now be disabled via configuration, would it be > feasible to optionally turn off this optimization? This way I can make the > trade-off decision between the incremental performance improvement from this > optimization or more reliable cluster-wide performance. Ideally, I would be > able to configure how many nodes should reply with "full data" with each > request. Thus I could increase this from 1 to 2 to avoid cluster-wide > performance degradation if any single node is performing poorly. By being > able to turn off or tune this setting I would also be able to do some a/b > testing to evaluate what performance benefit is being gained by this > optimization. > I'm curious to know if anyone else has run into this issue, and if anyone > else wishes they could turn off or tune this "full data"/md5 performance > optimization? > thanks, > Mason > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com