Have you ruled out EBS snapshot initialization issues ( https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html)?
On Tue, Mar 27, 2018 at 2:24 PM, Mike Torra <mto...@salesforce.com> wrote: > Hi There - > > I have noticed an issue where I consistently see high p999 read latency on > a node for a few hours after replacing the node. Before replacing the node, > the p999 read latency is ~30ms, but after it increases to 1-5s. I am > running C* 3.11.2 in EC2. > > I am testing out using EBS snapshots of the /data disk as a backup, so > that I can replace nodes without having to fully bootstrap the replacement. > This seems to work ok, except for the latency issue. Some things I have > noticed: > > - `nodetool netstats` doesn't show any 'Completed' Large Messages, only > 'Dropped', while this is going on. There are only a few of these. > - the logs show warnings like this: > > WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655 > NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s > with average duration of 235.88ms, 86 have exceeded the configured commit > interval by an average of 113.66ms > and I can see some slow queries in debug.log, but I can't figure out > what is causing it > - gc seems normal > > Could this have something to do with starting the node with the EBS > snapshot of the /data directory? My first thought was that this is related > to the EBS volumes, but it seems too consistent to be actually caused by > that. The problem is consistent across multiple replacements, and multiple > EC2 regions. > > I appreciate any suggestions! > > - Mike >