Hello! We are experiencing problems with Cassandra 2.2.8. There is a cluster with 3 nodes. Problematic keyspace has RF=3 and contains 3 tables (current table sizes: 1Gb, 700Mb, 12Kb).
Several times per day there are bursts of "READ messages were dropped ... for internal timeout" messages in logs (on every cassandra node). Duration: 5 - 15 minutes. During periods of drops there is always a queue of pending ReadStage tasks: Pool Name Active Pending Completed Blocked All time blocked ReadStage 32 67 2976548410 0 0 CompactionExecutor 2 62 802136 0 0 Others Active and Pending counters of tpstats are 0. During drops iostat says there is no read requests to disks, probably because all data fits in a disk cache: avg-cpu: %user %nice %system %iowait %steal %idle 56,53 0,94 39,84 0,01 0,00 2,68 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 11,00 0,00 26,00 0,00 9,09 715,92 0,78 30,31 0,00 30,31 2,46 6,40 sdb 0,00 11,00 0,00 33,00 0,00 10,57 655,70 0,83 26,00 0,00 26,00 2,00 6,60 sdc 0,00 1,00 0,00 30,50 0,00 10,98 737,07 0,91 30,49 0,00 30,49 2,10 6,40 sdd 0,00 31,50 0,00 35,00 0,00 11,17 653,50 0,98 28,17 0,00 28,17 1,83 6,40 sde 0,00 31,50 0,00 34,50 0,00 10,82 642,10 0,67 19,54 0,00 19,54 1,39 4,80 sdf 0,00 1,00 0,00 24,50 0,00 9,71 811,78 0,60 24,33 0,00 24,33 1,88 4,60 sdg 0,00 1,00 0,00 23,00 0,00 8,93 795,15 0,51 22,26 0,00 22,26 1,91 4,40 sdh 0,00 1,00 0,00 21,50 0,00 8,37 797,05 0,45 21,02 0,00 21,02 1,86 4,00 Disks are SSDs. Before that drops "Local write count" for problematic table increases very fast (10k-30k/sec, while ordinary write rate is 10-30/sec) during 1 minute. After that drops start. Tried useding probabilistic tracing to determine which requests cause "write count" to increase, but see no "batch_mutate" queries at all, only reads! There are no GC warnings about long pauses Could you please help troubleshooting the issue? -- Best Regards, Dmitry Simonov