Hi  there -

Cluster info:
C* 3.9, replicated across 4 EC2 regions (us-east-1, us-west-2, eu-west-1,
ap-southeast-1), c4.4xlarge

Around the same time every day (~7-8am EST), 2 DC's (eu-west-1 and
ap-southeast-1) in our cluster start experiencing a high number of timeouts
(Connection.TotalTimeouts metric). The issue seems to occur equally on all
nodes in the impacted DC. I'm trying to track down exactly what is timing
out, and what is causing it to happen.

With debug logs, I can see many messages like this:

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status NORMAL - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status removed - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 -
Convicting /xx.xx.xx.xx with status shutdown - alive false

The 'status removed' node I `nodetool remove`'d from the cluster, so I'm
not sure why that appears. The node mentioned in the 'status NORMAL' line
has constant warnings like this:

WARN  [GossipTasks:1] 2017-02-16 15:40:02,845 Gossiper.java:771 - Gossip
stage has 453589 pending tasks; skipping status check (no nodes will be
marked down)

These lines seem to go away after restarting that node, and on the original
node, the 'Convicting' lines go away as well. However, the timeout counts
do not seem to change. Why does restarting the node seem to fix gossip
falling behind?


There are also a lot of debug log messages like this:

DEBUG [GossipStage:1] 2017-02-16 15:45:04,849 FailureDetector.java:456 -
Ignoring interval time of 2355580769 for /xx.xx.xx.xx

Could these be related to the high number of timeouts I see? I've also
tried increasing the value of phi_convict_threshold to 12, as suggested
here:
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeFailDetect.html.
This does not seem to have changed anything on the nodes that I've changed
it on.

I appreciate any suggestions on what else to try in order to track down
these timeouts.

- Mike

Reply via email to