Hi there - Cluster info: C* 3.9, replicated across 4 EC2 regions (us-east-1, us-west-2, eu-west-1, ap-southeast-1), c4.4xlarge
Around the same time every day (~7-8am EST), 2 DC's (eu-west-1 and ap-southeast-1) in our cluster start experiencing a high number of timeouts (Connection.TotalTimeouts metric). The issue seems to occur equally on all nodes in the impacted DC. I'm trying to track down exactly what is timing out, and what is causing it to happen. With debug logs, I can see many messages like this: DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status NORMAL - alive false DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status removed - alive false DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 Gossiper.java:337 - Convicting /xx.xx.xx.xx with status shutdown - alive false The 'status removed' node I `nodetool remove`'d from the cluster, so I'm not sure why that appears. The node mentioned in the 'status NORMAL' line has constant warnings like this: WARN [GossipTasks:1] 2017-02-16 15:40:02,845 Gossiper.java:771 - Gossip stage has 453589 pending tasks; skipping status check (no nodes will be marked down) These lines seem to go away after restarting that node, and on the original node, the 'Convicting' lines go away as well. However, the timeout counts do not seem to change. Why does restarting the node seem to fix gossip falling behind? There are also a lot of debug log messages like this: DEBUG [GossipStage:1] 2017-02-16 15:45:04,849 FailureDetector.java:456 - Ignoring interval time of 2355580769 for /xx.xx.xx.xx Could these be related to the high number of timeouts I see? I've also tried increasing the value of phi_convict_threshold to 12, as suggested here: https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeFailDetect.html. This does not seem to have changed anything on the nodes that I've changed it on. I appreciate any suggestions on what else to try in order to track down these timeouts. - Mike