Hi, I am experiencing some weird behaviors after upgrading 2 nodes (out of 13) to C* 3.0.5 (from 2.1.11). Basically, after restarting a second time, there is a small chance that the node will die without outputting anything to the logs (not even dmesg).
This happened on both nodes I upgraded. The only "anomalies" I see in the logs (although not related to the moment a node dies) are: * Lots of the following messages against all IPs of the cluster (every second) DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - Ignoring interval time of 2540341017 for /x.y.b.5 DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - Ignoring interval time of 2000551507 for /x.y.a.7 DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - Ignoring interval time of 2000479104 for /x.y.a.3 DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 - Ignoring interval time of 2000471247 for /x.y.b.3 DEBUG [GossipStage:1] 2016-05-05 23:52:03,259 FailureDetector.java:456 - Ignoring interval time of 2000605748 for /x.y.a.5 DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 - Ignoring interval time of 2000731307 for /x.y.b.6 DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 - Ignoring interval time of 3000404107 for /x.y.b.1 * Some metrics are not being pushed to graphite (but some do get to the server). Also, every time the node tries to push them I can see the following error in the logs: ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 23:53:37,770 ScheduledReporter.java:119 - RuntimeException thrown from GraphiteReporter#report. Exception was suppressed. java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed at org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) ~[apache-cassandra-3.0.5.jar:3.0.5] at org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) ~[apache-cassandra-3.0.5.jar:3.0.5] at com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) ~[metrics-core-3.1.0.jar:3.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] Besides these, logs are clean. I've opened a ticket here ( https://issues.apache.org/jira/browse/CASSANDRA-11723) but any help debugging this is more than welcome. Regards, Stefano