Hello! We've met several times the following problem.
Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes: - all CPUs have 100% load (normally we have LA 5 on 16-cores machine) - cassandra's threads count raises from 300 to 1300 - 2000,most of them are Thrift threads in java.net.SocketInputStream.socketRead0(Native Method) method, count of other threads doesn't increase - some Read messages are dropped - read latency (p99.9) increases to 20-30 seconds - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks Problem starts synchronously on all nodes of cluster. I cannot tie this problem with increased load from clients ("read rate" does't increase during the problem). Also looks like there is no problem with disks (I/O latencies are OK). Could anybody please give some advice in further troubleshooting? -- Best Regards, Dmitry Simonov