Yes, LOCAL_QUORUM is usually sufficient and I use this consistency level often, but at times a stronger consistency level is desired. Even within a single DC, blocking in a fixed size thread pool on multiple network calls is less than ideal.
If light weight transactions are being adopted warmly by the user base, then the need for strong consistency reads increases. Operationally, we like running background jobs in low-traffic regions, or where most of our users are in bed for the night, so being sure we're getting the most accurate data globally is of very high importance. Some of these jobs might slip in before other functions like read-repair and nodetool repair have been able to clean things up. So my question to the community is, if my assessment of current threading during read-stage is correct, are there plans or thoughts on moving to a more asynchronous read response gathering model? Also, this seems like a blind spot in current cassandra metrics (at least in my 2.0.x world), that we do not have metrics on how long queries are queued up before being processed. The recent read latency metric continues to indicate normal even during high over all system latency. On Fri, Jan 8, 2016 at 7:03 PM, Russell Bradberry <rbradbe...@gmail.com> wrote: > While using LOCAL_QUORUM may be a solution in a lot of use cases, there > are definitely use cases where reading at full QUORUM is required, think > finance, medical, military. I think for these types of use cases using non > blocking behavior will be an incredible improvement in performance. Even > for LOCAL_QUORUM it would be a great improvement. > Simply stating to use LQ makes it seem like this use case is meritless, > when it surely is not. > Plus anything that improves performance is, in my opinion, a good idea. > Whether or not it is worth the development investment is not something I > can speak on. > > > > > On Fri, Jan 8, 2016 at 4:48 PM -0800, "Jonathan Haddad" <j...@jonhaddad.com> > wrote: > > > > > > > > > > > Use local quorum, don't talk to remote dcs. > On Fri, Jan 8, 2016 at 1:41 PM Dominic Chevalier > wrote: > > > Hello, > > > > tldr; > > > > It looks like StorageProxy.fetchRows blocking for read responses can get > > pretty bad during quorum reads involving many geographically distant data > > centers. If this is true, why doesn't the coordinator handle replies > > asynchronously to keep over all throughput up? > > > > Long; > > > > I'm running apache cassandra 2.0.16 with ~400 nodes total, spread > > throughout 5 AWS regions globally. > > > > I tried running many (hundreds) simultaneous paged range scans over large > > token ranges, 'select * from table where token(partition_key) >= ? and > > token(partition_key) < ?' at consistency level QUORUM. Replication factor > > 3. Row size is small, a few hundred bytes max. > > > > This caused the cassandra nodes in the local data center hosting the > > application to become quite sluggish to other queries. Upon investigation > > of the code, it looks like, and comments say the same, that > > StorageProxy.fetchRows blocks for reads, even if the read comes from a > > remote node. > > > > Based on the behavior I observed, and the impact on other queries, I > > suspected the quorum reads were blocking the read stage executor pool of > > the coordinator nodes. > > > > If I've drawn the correct conclusions, why does the read stage block for > > reads from other nodes, especially nodes in remote datacenter where > latency > > is not small, rather than asynchronously processing read replies and > > freeing up the read stage threads? > > > > I came across https://issues.apache.org/jira/browse/CASSANDRA-10989 > which > > seems to target performance improvements in the threading model, which > made > > me more curious about the above question. > > > > Thoughts and info are greatly appreciated. > > > > Kindly, > > Dominic > > > > > > > >