I'm running a big test -- ten nodes with 3T disk each. I'm using 0.7.0rc1. After some tuning help (thanks Tyler) lots of this is working as it should. However a serious event occurred as well -- the server froze up -- and though mutations were dropped, no error was reported to the client. Here's what the log said on host X.19:
WARN [ScheduledTasks:1] 2010-12-06 14:04:11,125 MessagingService.java (line 527) Dropped 76 MUTATION messages in the last 5000ms Meanwhile, on the OTHER nodes, gossip decided the node was not available for a while: INFO [ScheduledTasks:1] 2010-12-06 14:04:02,396 Gossiper.java (line 195) InetAddress /X.19 is now dead. INFO [GossipStage:1] 2010-12-06 14:04:06,127 Gossiper.java (line 569) InetAddress /X.19 is now UP And despite the fact that I was writing with consistency=ALL, none of my clients reported any errors on their mutations. Tyler has this information but I would like to know if anyone has seen this before, and/or has a diagnosis.