Use local quorum, don't talk to remote dcs. On Fri, Jan 8, 2016 at 1:41 PM Dominic Chevalier <dccheval...@gmail.com> wrote:
> Hello, > > tldr; > > It looks like StorageProxy.fetchRows blocking for read responses can get > pretty bad during quorum reads involving many geographically distant data > centers. If this is true, why doesn't the coordinator handle replies > asynchronously to keep over all throughput up? > > Long; > > I'm running apache cassandra 2.0.16 with ~400 nodes total, spread > throughout 5 AWS regions globally. > > I tried running many (hundreds) simultaneous paged range scans over large > token ranges, 'select * from table where token(partition_key) >= ? and > token(partition_key) < ?' at consistency level QUORUM. Replication factor > 3. Row size is small, a few hundred bytes max. > > This caused the cassandra nodes in the local data center hosting the > application to become quite sluggish to other queries. Upon investigation > of the code, it looks like, and comments say the same, that > StorageProxy.fetchRows blocks for reads, even if the read comes from a > remote node. > > Based on the behavior I observed, and the impact on other queries, I > suspected the quorum reads were blocking the read stage executor pool of > the coordinator nodes. > > If I've drawn the correct conclusions, why does the read stage block for > reads from other nodes, especially nodes in remote datacenter where latency > is not small, rather than asynchronously processing read replies and > freeing up the read stage threads? > > I came across https://issues.apache.org/jira/browse/CASSANDRA-10989 which > seems to target performance improvements in the threading model, which made > me more curious about the above question. > > Thoughts and info are greatly appreciated. > > Kindly, > Dominic >