Re: "READ messages were dropped ... for internal timeout" after big amount of writes

Nicolas Guyomar Fri, 16 Mar 2018 01:29:51 -0700

Hi,

You also have 62 pending compactions at the same time, which is odd for
such a small dataset IHMO, are you triggering 'nodetool compact' with some
kind of cron you may have forgot after a test or something else ?
Do you have any monitoring in place ? If not, you could let some 'dstat
-tnrvl 10' for a while and look for inconsistency (huge I/O wait at some
point, blocked proc etc)





On 16 March 2018 at 07:33, Dmitry Simonov <dimmobor...@gmail.com> wrote:

> Hello!
>
> We are experiencing problems with Cassandra 2.2.8.
> There is a cluster with 3 nodes.
> Problematic keyspace has RF=3 and contains 3 tables (current table sizes:
> 1Gb, 700Mb, 12Kb).
>
> Several times per day there are bursts of "READ messages were dropped ...
> for internal timeout" messages in logs (on every cassandra node). Duration:
> 5 - 15 minutes.
>
> During periods of drops there is always a queue of pending ReadStage tasks:
>
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> ReadStage                        32        67     2976548410         0        
>          0
> CompactionExecutor                2        62         802136         0        
>          0
>
> Others Active and Pending counters of tpstats are 0.
>
> During drops iostat says there is no read requests to disks, probably
> because all data fits in a disk cache:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           56,53    0,94   39,84    0,01    0,00    2,68
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00    11,00    0,00   26,00     0,00     9,09   715,92    
>  0,78   30,31    0,00   30,31   2,46   6,40
> sdb               0,00    11,00    0,00   33,00     0,00    10,57   655,70    
>  0,83   26,00    0,00   26,00   2,00   6,60
> sdc               0,00     1,00    0,00   30,50     0,00    10,98   737,07    
>  0,91   30,49    0,00   30,49   2,10   6,40
> sdd               0,00    31,50    0,00   35,00     0,00    11,17   653,50    
>  0,98   28,17    0,00   28,17   1,83   6,40
> sde               0,00    31,50    0,00   34,50     0,00    10,82   642,10    
>  0,67   19,54    0,00   19,54   1,39   4,80
> sdf               0,00     1,00    0,00   24,50     0,00     9,71   811,78    
>  0,60   24,33    0,00   24,33   1,88   4,60
> sdg               0,00     1,00    0,00   23,00     0,00     8,93   795,15    
>  0,51   22,26    0,00   22,26   1,91   4,40
> sdh               0,00     1,00    0,00   21,50     0,00     8,37   797,05    
>  0,45   21,02    0,00   21,02   1,86   4,00
>
> Disks are SSDs.
>
> Before that drops "Local write count" for problematic table increases very
> fast (10k-30k/sec, while ordinary write rate is 10-30/sec) during 1 minute.
> After that drops start.
>
> Tried useding probabilistic tracing to determine which requests cause
> "write count" to increase, but see no "batch_mutate" queries at all, only
> reads!
>
> There are no GC warnings about long pauses
>
> Could you please help troubleshooting the issue?
>
> --
> Best Regards,
> Dmitry Simonov
>

Re: "READ messages were dropped ... for internal timeout" after big amount of writes

Reply via email to