Re: "READ messages were dropped ... for internal timeout" after big amount of writes

Dmitry Simonov Mon, 19 Mar 2018 01:27:12 -0700

Thank you for the recommendation!

Most of pending compactions are for another (~100 times larger) keyspace.
They are always running in the background.


2018-03-16 13:28 GMT+05:00 Nicolas Guyomar <nicolas.guyo...@gmail.com>:

> Hi,
>
> You also have 62 pending compactions at the same time, which is odd for
> such a small dataset IHMO, are you triggering 'nodetool compact' with some
> kind of cron you may have forgot after a test or something else ?
> Do you have any monitoring in place ? If not, you could let some 'dstat
> -tnrvl 10' for a while and look for inconsistency (huge I/O wait at some
> point, blocked proc etc)
>
>
>
>
> On 16 March 2018 at 07:33, Dmitry Simonov <dimmobor...@gmail.com> wrote:
>
>> Hello!
>>
>> We are experiencing problems with Cassandra 2.2.8.
>> There is a cluster with 3 nodes.
>> Problematic keyspace has RF=3 and contains 3 tables (current table sizes:
>> 1Gb, 700Mb, 12Kb).
>>
>> Several times per day there are bursts of "READ messages were dropped ...
>> for internal timeout" messages in logs (on every cassandra node). Duration:
>> 5 - 15 minutes.
>>
>> During periods of drops there is always a queue of pending ReadStage
>> tasks:
>>
>> Pool Name                    Active   Pending      Completed   Blocked  All 
>> time blocked
>> ReadStage                        32        67     2976548410         0       
>>           0
>> CompactionExecutor                2        62         802136         0       
>>           0
>>
>> Others Active and Pending counters of tpstats are 0.
>>
>> During drops iostat says there is no read requests to disks, probably
>> because all data fits in a disk cache:
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           56,53    0,94   39,84    0,01    0,00    2,68
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sda               0,00    11,00    0,00   26,00     0,00     9,09   715,92   
>>   0,78   30,31    0,00   30,31   2,46   6,40
>> sdb               0,00    11,00    0,00   33,00     0,00    10,57   655,70   
>>   0,83   26,00    0,00   26,00   2,00   6,60
>> sdc               0,00     1,00    0,00   30,50     0,00    10,98   737,07   
>>   0,91   30,49    0,00   30,49   2,10   6,40
>> sdd               0,00    31,50    0,00   35,00     0,00    11,17   653,50   
>>   0,98   28,17    0,00   28,17   1,83   6,40
>> sde               0,00    31,50    0,00   34,50     0,00    10,82   642,10   
>>   0,67   19,54    0,00   19,54   1,39   4,80
>> sdf               0,00     1,00    0,00   24,50     0,00     9,71   811,78   
>>   0,60   24,33    0,00   24,33   1,88   4,60
>> sdg               0,00     1,00    0,00   23,00     0,00     8,93   795,15   
>>   0,51   22,26    0,00   22,26   1,91   4,40
>> sdh               0,00     1,00    0,00   21,50     0,00     8,37   797,05   
>>   0,45   21,02    0,00   21,02   1,86   4,00
>>
>> Disks are SSDs.
>>
>> Before that drops "Local write count" for problematic table increases
>> very fast (10k-30k/sec, while ordinary write rate is 10-30/sec) during 1
>> minute. After that drops start.
>>
>> Tried useding probabilistic tracing to determine which requests cause
>> "write count" to increase, but see no "batch_mutate" queries at all, only
>> reads!
>>
>> There are no GC warnings about long pauses
>>
>> Could you please help troubleshooting the issue?
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>
>


-- 
Best Regards,
Dmitry Simonov

Re: "READ messages were dropped ... for internal timeout" after big amount of writes

Reply via email to