Re: Bursts of Thrift threads make cluster unresponsive

Dmitry Simonov Thu, 27 Jun 2019 14:23:33 -0700

> Is there an order in which the events you described happened, or is the
order with which you presented them the order you notice things going
wrong?


At first, threads count (Thrift) start increasing.
After 2 or 3 minutes they consume all CPU cores.
After that, simultaneously: message drops occur, read latency increases,
active read tasks are noticed.

пт, 28 июн. 2019 г. в 01:40, Avinash Mandava <avin...@vorstella.com>:

> Yeah i skimmed too fast, don't add more work if CPU is pegged, and if
> using thrift protocol NTR would not have values.
>
> Is there an order in which the events you described happened, or is the
> order with which you presented them the order you notice things going
> wrong?
>
> On Thu, Jun 27, 2019 at 1:29 PM Dmitry Simonov <dimmobor...@gmail.com>
> wrote:
>
>> Thanks for your reply!
>>
>> > Have you tried increasing concurrent reads until you see more activity
>> in disk?
>> When problem occurs, freshly created 1.2k - 2k Thrift threads consume all
>> CPU on all cores.
>> Does increasing concurrent reads may help in this situation?
>>
>> >
>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>> This metric is 0 at all cluster nodes.
>>
>> пт, 28 июн. 2019 г. в 00:34, Avinash Mandava <avin...@vorstella.com>:
>>
>>> Have you tried increasing concurrent reads until you see more activity
>>> in disk? If you've always got 32 active reads and high pending reads it
>>> could just be dropping the reads because the queues are saturated. Could be
>>> artificially bottlenecking at the C* process level.
>>>
>>> Also what does this metric show over time:
>>>
>>>
>>> org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count
>>>
>>>
>>>
>>> On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <dimmobor...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> We've met several times the following problem.
>>>>
>>>> Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes:
>>>> - all CPUs have 100% load (normally we have LA 5 on 16-cores machine)
>>>> - cassandra's threads count raises from 300 to 1300 - 2000,most of them
>>>> are Thrift threads in java.net.SocketInputStream.socketRead0(Native
>>>> Method) method, count of other threads doesn't increase
>>>> - some Read messages are dropped
>>>> - read latency (p99.9) increases to 20-30 seconds
>>>> - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks
>>>>
>>>> Problem starts synchronously on all nodes of cluster.
>>>> I cannot tie this problem with increased load from clients ("read rate"
>>>> does't increase during the problem).
>>>> Also looks like there is no problem with disks (I/O latencies are OK).
>>>>
>>>> Could anybody please give some advice in further troubleshooting?
>>>>
>>>> --
>>>> Best Regards,
>>>> Dmitry Simonov
>>>>
>>>
>>>
>>> --
>>> www.vorstella.com
>>> 408 691 8402
>>>
>>
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>
>
> --
> www.vorstella.com
> 408 691 8402
>


-- 
Best Regards,
Dmitry Simonov

Re: Bursts of Thrift threads make cluster unresponsive

Reply via email to