Hi There -

I have noticed an issue where I consistently see high p999 read latency on
a node for a few hours after replacing the node. Before replacing the node,
the p999 read latency is ~30ms, but after it increases to 1-5s. I am
running C* 3.11.2 in EC2.

I am testing out using EBS snapshots of the /data disk as a backup, so that
I can replace nodes without having to fully bootstrap the replacement. This
seems to work ok, except for the latency issue. Some things I have noticed:

- `nodetool netstats` doesn't show any 'Completed' Large Messages, only
'Dropped', while this is going on. There are only a few of these.
- the logs show warnings like this:

WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-03-27 18:57:15,655
NoSpamLogger.java:94 - Out of 84 commit log syncs over the past 297.28s
with average duration of 235.88ms, 86 have exceeded the configured commit
interval by an average of 113.66ms
  and I can see some slow queries in debug.log, but I can't figure out what
is causing it
- gc seems normal

Could this have something to do with starting the node with the EBS
snapshot of the /data directory? My first thought was that this is related
to the EBS volumes, but it seems too consistent to be actually caused by
that. The problem is consistent across multiple replacements, and multiple
EC2 regions.

I appreciate any suggestions!

- Mike

Reply via email to