Emils,

We believe we've tracked it down to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5.

We are running a build of 2.2.5 with that patch and so far have not seen
any more timeouts.

Mike

On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis <emils.solma...@gmail.com>
wrote:

> Mike,
>
> Is that where you've bisected it to having been introduced?
>
> I'll see what I can do, but doubt it, since we've long since upgraded prod
> to 2.2.4 (and stage before that) and the tests I'm running were for a new
> feature.
>
>
> On Fri, 4 Mar 2016 03:54 Mike Heffner, <m...@librato.com> wrote:
>
>> Emils,
>>
>> I realize this may be a big downgrade, but are you timeouts reproducible
>> under Cassandra 2.1.4?
>>
>> Mike
>>
>> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <
>> emils.solma...@gmail.com> wrote:
>>
>>> Having had a read through the archives, I missed this at first, but this
>>> seems to be *exactly* like what we're experiencing.
>>>
>>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>>>
>>> Only difference is we're getting this for reads and using CQL, but the
>>> behaviour is identical.
>>>
>>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We're having a problem with concurrent requests. It seems that whenever
>>>> we try resolving more
>>>> than ~ 15 queries at the same time, one or two get a read timeout and
>>>> then succeed on a retry.
>>>>
>>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>>>> AWS.
>>>>
>>>> What we've found while investigating:
>>>>
>>>>  * this is not db-wide. Trying the same pattern against another table
>>>> everything works fine.
>>>>  * it fails 1 or 2 requests regardless of how many are executed in
>>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>>>> requests and doesn't seem to scale up.
>>>>  * the problem is consistently reproducible. It happens both under
>>>> heavier load and when just firing off a single batch of requests for
>>>> testing.
>>>>  * tracing the faulty requests says everything is great. An example
>>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>>>  * the only peculiar thing in the logs is there's no acknowledgement of
>>>> the request being accepted by the server, as seen in
>>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>>>  * there's nothing funny in the timed out Cassandra node's logs around
>>>> that time as far as I can tell, not even in the debug logs.
>>>>
>>>> Any ideas about what might be causing this, pointers to server config
>>>> options, or how else we might debug this would be much appreciated.
>>>>
>>>> Kind regards,
>>>> Emils
>>>>
>>>>
>>
>>
>> --
>>
>>   Mike Heffner <m...@librato.com>
>>   Librato, Inc.
>>
>>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.

Reply via email to