More weirdness with my four-or-five-node cluster of 0.7 beta3. Having brought up all five nodes, including the one that didn't loadbalance right, I tried loadbalancing it again. (This is under completely idle conditions - no external reads or writes.) The result is a cluster where each node thinks it's the only one thats' Up. (Or so they all report when queried with "nodetool ring".) It's been 15 minutes and no nodes will talk to any other nodes, at all. I already tried restarting them all, and it's happened again.
Some more detail: This is a four-node cluster where hosts X.22 through X.19 have been up and running, accepting a lot of data, over several days. Their loads are all about 500GB now. (Actually their data disks are more than 50% full, which is why I'm trying to add four more nodes, one at a time.) I brought up X.18, which correctly gave itself a good token, but didn't stream itself any data. So I figured I'd kick off the streaming process with a "loadbalance" command. I ran nodetool -h X.18 loadbalance which kind of worked; it got as far as 'waiting 90s for load information' in its log. But this operation seems also to have stopped gossip altogether ring indicated that only X.22 was up, and X.21, X.20, and X.19 had taken themselves down (or at least out of gossip). When I looked at the log for X.21, I found the below. The first line looks normal; host X.18 had taken itself down for loadbalancing, after all. But then this node also decided that everyone else was dead. (As did all the other nodes, about all the others.) INFO [GossipStage:1] 2010-11-06 16:39:45,109 HintedHandOffManager.java (line 151) Deleting any stored hints for /X.18 INFO [GossipStage:1] 2010-11-06 16:39:45,116 ColumnFamilyStore.java (line 631) switching in a fresh Memtable for HintsColumnFamily at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1289078947182.log', position=888) INFO [GossipStage:1] 2010-11-06 16:39:45,117 ColumnFamilyStore.java (line 930) Enqueuing flush of memtable-hintscolumnfam...@1878733456(0 bytes, 0 operations) INFO [FlushWriter:1] 2010-11-06 16:39:45,118 Memtable.java (line 154) Writing memtable-hintscolumnfam...@1878733456(0 bytes, 0 operations) INFO [ScheduledTasks:1] 2010-11-06 16:39:46,080 GCInspector.java (line 133) GC for ParNew: 320 ms, 151307960 reclaimed leaving 9401921352 used; max is 34557919232 INFO [FlushWriter:1] 2010-11-06 16:39:46,242 Memtable.java (line 161) Completed flushing /var/lib/cassandra/data/system/HintsColumnFamily-e-76-Data.db INFO [ScheduledTasks:1] 2010-11-06 16:39:53,921 Gossiper.java (line 195) InetAddress /X.22 is now dead. INFO [ScheduledTasks:1] 2010-11-06 16:39:54,922 Gossiper.java (line 195) InetAddress /X.20 is now dead. INFO [ScheduledTasks:1] 2010-11-06 16:39:55,924 Gossiper.java (line 195) InetAddress /X.19 is now dead.