I have a small, two-node cluster running Cassandra 2.2.1. I am seeing a lot of these messages in both logs:
WARN 07:23:16 Not marking nodes down due to local pause of 7219277694 > 5000000000 I am fairly certain that they are not due to GC. I am not seeing a whole of GC being logged and nothing over 500 ms. I do think it is I/O related. I am seeing lots of read timeouts for queries to a table that has a large growing number of SSTables. At last count there are over 1800 SSTables on one node. The count is lower on the other node, and I suspect that this is due to data distribution. Slowly but surely the number of SSTables keeps going up, and not surprisingly nodetool tablehistograms reports high latencies. The table is using STCS. I am seeing some but not a whole lot of dropped mutations. nodetool tpstats looks ok. The growing number of SSTables really makes me think this is an I/O issue. Casssandra is running in a kubernetes cluster using a SAN which is another reason I suspect I/O. What are some things I can look at/test to determine what is causing all of local pauses? - John