Hi, You also have 62 pending compactions at the same time, which is odd for such a small dataset IHMO, are you triggering 'nodetool compact' with some kind of cron you may have forgot after a test or something else ? Do you have any monitoring in place ? If not, you could let some 'dstat -tnrvl 10' for a while and look for inconsistency (huge I/O wait at some point, blocked proc etc)
On 16 March 2018 at 07:33, Dmitry Simonov <dimmobor...@gmail.com> wrote: > Hello! > > We are experiencing problems with Cassandra 2.2.8. > There is a cluster with 3 nodes. > Problematic keyspace has RF=3 and contains 3 tables (current table sizes: > 1Gb, 700Mb, 12Kb). > > Several times per day there are bursts of "READ messages were dropped ... > for internal timeout" messages in logs (on every cassandra node). Duration: > 5 - 15 minutes. > > During periods of drops there is always a queue of pending ReadStage tasks: > > Pool Name Active Pending Completed Blocked All > time blocked > ReadStage 32 67 2976548410 0 > 0 > CompactionExecutor 2 62 802136 0 > 0 > > Others Active and Pending counters of tpstats are 0. > > During drops iostat says there is no read requests to disks, probably > because all data fits in a disk cache: > > avg-cpu: %user %nice %system %iowait %steal %idle > 56,53 0,94 39,84 0,01 0,00 2,68 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sda 0,00 11,00 0,00 26,00 0,00 9,09 715,92 > 0,78 30,31 0,00 30,31 2,46 6,40 > sdb 0,00 11,00 0,00 33,00 0,00 10,57 655,70 > 0,83 26,00 0,00 26,00 2,00 6,60 > sdc 0,00 1,00 0,00 30,50 0,00 10,98 737,07 > 0,91 30,49 0,00 30,49 2,10 6,40 > sdd 0,00 31,50 0,00 35,00 0,00 11,17 653,50 > 0,98 28,17 0,00 28,17 1,83 6,40 > sde 0,00 31,50 0,00 34,50 0,00 10,82 642,10 > 0,67 19,54 0,00 19,54 1,39 4,80 > sdf 0,00 1,00 0,00 24,50 0,00 9,71 811,78 > 0,60 24,33 0,00 24,33 1,88 4,60 > sdg 0,00 1,00 0,00 23,00 0,00 8,93 795,15 > 0,51 22,26 0,00 22,26 1,91 4,40 > sdh 0,00 1,00 0,00 21,50 0,00 8,37 797,05 > 0,45 21,02 0,00 21,02 1,86 4,00 > > Disks are SSDs. > > Before that drops "Local write count" for problematic table increases very > fast (10k-30k/sec, while ordinary write rate is 10-30/sec) during 1 minute. > After that drops start. > > Tried useding probabilistic tracing to determine which requests cause > "write count" to increase, but see no "batch_mutate" queries at all, only > reads! > > There are no GC warnings about long pauses > > Could you please help troubleshooting the issue? > > -- > Best Regards, > Dmitry Simonov >