Emils, We believe we've tracked it down to the following issue: https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5.
We are running a build of 2.2.5 with that patch and so far have not seen any more timeouts. Mike On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis <emils.solma...@gmail.com> wrote: > Mike, > > Is that where you've bisected it to having been introduced? > > I'll see what I can do, but doubt it, since we've long since upgraded prod > to 2.2.4 (and stage before that) and the tests I'm running were for a new > feature. > > > On Fri, 4 Mar 2016 03:54 Mike Heffner, <m...@librato.com> wrote: > >> Emils, >> >> I realize this may be a big downgrade, but are you timeouts reproducible >> under Cassandra 2.1.4? >> >> Mike >> >> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis < >> emils.solma...@gmail.com> wrote: >> >>> Having had a read through the archives, I missed this at first, but this >>> seems to be *exactly* like what we're experiencing. >>> >>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html >>> >>> Only difference is we're getting this for reads and using CQL, but the >>> behaviour is identical. >>> >>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> We're having a problem with concurrent requests. It seems that whenever >>>> we try resolving more >>>> than ~ 15 queries at the same time, one or two get a read timeout and >>>> then succeed on a retry. >>>> >>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on >>>> AWS. >>>> >>>> What we've found while investigating: >>>> >>>> * this is not db-wide. Trying the same pattern against another table >>>> everything works fine. >>>> * it fails 1 or 2 requests regardless of how many are executed in >>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent >>>> requests and doesn't seem to scale up. >>>> * the problem is consistently reproducible. It happens both under >>>> heavier load and when just firing off a single batch of requests for >>>> testing. >>>> * tracing the faulty requests says everything is great. An example >>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a >>>> * the only peculiar thing in the logs is there's no acknowledgement of >>>> the request being accepted by the server, as seen in >>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a >>>> * there's nothing funny in the timed out Cassandra node's logs around >>>> that time as far as I can tell, not even in the debug logs. >>>> >>>> Any ideas about what might be causing this, pointers to server config >>>> options, or how else we might debug this would be much appreciated. >>>> >>>> Kind regards, >>>> Emils >>>> >>>> >> >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> -- Mike Heffner <m...@librato.com> Librato, Inc.