Hi. We are faced with strange problem where Cassandra nodes lose each other only one day of week, on friday, in exactly 14:50 PM, within several months.
On that time each node periodically reports that other nodes are dead. At same time nodes are working fine. This continues about one hour, after that cluster stabilizes. Low CPU load. There are several snippets of log file from one node: TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.228 : 38.154333610365036 INFO [GossipTasks:1] 2011-12-02 15:12:51,829 Gossiper.java (line 229) InetAddress /192.168.68.228 is now dead. ... DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean INFO [ScheduledTasks:1] 2011-12-02 15:12:51,829 StatusLogger.java (line 66) ReadRepairStage 0 0 0 TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.227 : -0.0 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.229 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.224 : 0.019569070233147485 INFO [ScheduledTasks:1] 2011-12-02 15:12:51,845 StatusLogger.java (line 66) MutationStage 0 0 0 TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.226 : 37.966339304199074 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.228 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean INFO [GossipTasks:1] 2011-12-02 15:12:51,845 Gossiper.java (line 229) InetAddress /192.168.68.226 is now dead. ... TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045 TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155 TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645 INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead. DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557 ... TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045 TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155 TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645 INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead. DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557 TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 307) Gossip Digests are : /192.168.68.221:1322136327:682506 /192.168.68.223:1322116132:702923 /192.168.68.222:1322116089:702938 /192.168.68.228:1322116156:702981 /192.168.68.225:1322817130:31 /192.168.68.230:1322116110:702870 /192.168.68.226:1322116095:702557 /192.168.68.221:1322136327:682506 /192.168.68.224:1322116106:702922 /192.168.68.227:1322116098:702974 /192.168.68.229:1322116107:702950 TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.224 ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.228 ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 101) Performing status check ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.228 : 8.350335221549706 TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.223 : 8.222055442973863 INFO [GossipTasks:1] 2011-12-02 15:13:04,904 Gossiper.java (line 229) InetAddress /192.168.68.223 is now dead. The same picture on other nodes. Cassandra version 7.8. OS Windows server 2008R2. Cluster size 10 nodes. Replication factor 5. Best regards, Konstantin Chernyakov.