Konstantin, Have you checked the weekly cron job list on the servers or looked at the system logs at those rough times to see what the servers are doing? I doubt Cassandra has any time-sensitive code in it to kill off connections at 14:50pm, so my guess is something on the host causing the problem.
-R On Mon, Dec 5, 2011 at 6:08 AM, Konstantin Chernyakov <kossof...@gmail.com>wrote: > Hi. > > We are faced with strange problem where Cassandra nodes lose each other > only one day of week, on friday, in exactly 14:50 PM, within several months. > > On that time each node periodically reports that other nodes are dead. > > At same time nodes are working fine. > > This continues about one hour, after that cluster stabilizes. > > Low CPU load. > > > > There are several snippets of log file from one node: > > > > TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line > 149) PHI for /192.168.68.228 : 38.154333610365036 > > INFO [GossipTasks:1] 2011-12-02 15:12:51,829 Gossiper.java (line 229) > InetAddress /192.168.68.228 is now dead. > > > > ... > > > > DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java > (line 819) forceFlush requested but everything is clean > > INFO [ScheduledTasks:1] 2011-12-02 15:12:51,829 StatusLogger.java (line > 66) ReadRepairStage 0 0 0 > > TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line > 149) PHI for /192.168.68.227 : -0.0 > > DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java > (line 819) forceFlush requested but everything is clean > > TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line > 128) reporting /192.168.68.229 > > DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java > (line 819) forceFlush requested but everything is clean > > TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line > 149) PHI for /192.168.68.224 : 0.019569070233147485 > > INFO [ScheduledTasks:1] 2011-12-02 15:12:51,845 StatusLogger.java (line > 66) MutationStage 0 0 0 > > TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line > 149) PHI for /192.168.68.226 : 37.966339304199074 > > DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java > (line 819) forceFlush requested but everything is clean > > TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line > 128) reporting /192.168.68.228 > > DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java > (line 819) forceFlush requested but everything is clean > > INFO [GossipTasks:1] 2011-12-02 15:12:51,845 Gossiper.java (line 229) > InetAddress /192.168.68.226 is now dead. > > > > ... > > > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line > 149) PHI for /192.168.68.228 : 7.7043961801903045 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line > 149) PHI for /192.168.68.223 : 7.585990557120916 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.227 : 7.922553972766636 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.224 : 7.798568512691048 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.226 : 7.8425064901177715 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.225 : 4.592224429445155 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line > 149) PHI for /192.168.68.222 : 8.06856164053645 > > INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) > InetAddress /192.168.68.222 is now dead. > > DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line > 153) Resetting pool for /192.168.68.222 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line > 149) PHI for /192.168.68.229 : 7.645354417332889 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line > 149) PHI for /192.168.68.230 : 7.775610031554557 > > > > ... > > > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line > 149) PHI for /192.168.68.228 : 7.7043961801903045 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line > 149) PHI for /192.168.68.223 : 7.585990557120916 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.227 : 7.922553972766636 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.224 : 7.798568512691048 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.226 : 7.8425064901177715 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line > 149) PHI for /192.168.68.225 : 4.592224429445155 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line > 149) PHI for /192.168.68.222 : 8.06856164053645 > > INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) > InetAddress /192.168.68.222 is now dead. > > DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line > 153) Resetting pool for /192.168.68.222 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line > 149) PHI for /192.168.68.229 : 7.645354417332889 > > TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line > 149) PHI for /192.168.68.230 : 7.775610031554557 > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 307) > Gossip Digests are : /192.168.68.221:1322136327:682506 > /192.168.68.223:1322116132:702923 /192.168.68.222:1322116089:702938 > /192.168.68.228:1322116156:702981 /192.168.68.225:1322817130:31 > /192.168.68.230:1322116110:702870 /192.168.68.226:1322116095:702557 > /192.168.68.221:1322136327:682506 /192.168.68.224:1322116106:702922 > /192.168.68.227:1322116098:702974 /192.168.68.229:1322116107:702950 > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) > Sending a GossipDigestSynMessage to /192.168.68.224 ... > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) > Sending a GossipDigestSynMessage to /192.168.68.228 ... > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 101) > Performing status check ... > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line > 149) PHI for /192.168.68.228 : 8.350335221549706 > > TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line > 149) PHI for /192.168.68.223 : 8.222055442973863 > > INFO [GossipTasks:1] 2011-12-02 15:13:04,904 Gossiper.java (line 229) > InetAddress /192.168.68.223 is now dead. > > > > The same picture on other nodes. > > > > Cassandra version 7.8. > > OS Windows server 2008R2. > > Cluster size 10 nodes. > > Replication factor 5. > > > > Best regards, > > Konstantin Chernyakov. > >