Hi There - I have noticed an issue where I consistently see high p999 read latency on a node for a few hours after replacing the node. Before replacing the node, the p999 read latency is ~30ms, but after it increases to 1-5s. I am running C* 3.11.2 in EC2.
I am testing out using EBS snapshots of the /data disk as a backup, so that I can replace nodes without having to fully bootstrap the replacement. This seems to work ok, except for the latency issue. Some things I have noticed: - `nodetool netstats` doesn't show any 'Completed' Large Messages, only 'Dropped', while this is going on. There are only a few of these. - the logs show warnings like this: WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655 NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s with average duration of 235.88ms, 86 have exceeded the configured commit interval by an average of 113.66ms and I can see some slow queries in debug.log, but I can't figure out what is causing it - gc seems normal Could this have something to do with starting the node with the EBS snapshot of the /data directory? My first thought was that this is related to the EBS volumes, but it seems too consistent to be actually caused by that. The problem is consistent across multiple replacements, and multiple EC2 regions. I appreciate any suggestions! - Mike