loadbalance kills gossip?

Reverend Chip Sat, 06 Nov 2010 16:56:27 -0700

More weirdness with my four-or-five-node cluster of 0.7 beta3.  Having 
brought up all five nodes, including the one that didn't loadbalance
right, I tried loadbalancing it again.  (This is under completely idle
conditions - no external reads or writes.)  The result is a cluster
where each node thinks it's the only one thats' Up.  (Or so they all
report when queried with "nodetool ring".)  It's been 15 minutes and no
nodes will talk to any other nodes, at all.  I already tried restarting
them all, and it's happened again.


Some more detail: This is a four-node cluster where hosts X.22 through
X.19 have been up and running, accepting a lot of data, over several
days.  Their loads are all about 500GB now.  (Actually their data disks
are more than 50% full, which is why I'm trying to add four more nodes,
one at a time.)  I brought up X.18, which correctly gave itself a good
token, but didn't stream itself any data.  So I figured I'd kick off the
streaming process with a "loadbalance" command.  I ran
    nodetool -h X.18 loadbalance
which kind of worked; it got as far as 'waiting 90s for load
information' in its log.  But this operation seems also to have stopped
gossip altogether ring indicated that only X.22 was up, and X.21, X.20,
and X.19 had taken themselves down (or at least out of gossip).  When I
looked at the log for X.21, I found the below.  The first line looks
normal; host X.18 had taken itself down for loadbalancing, after all. 
But then this node also decided that everyone else was dead.  (As did
all the other nodes, about all the others.)

 INFO [GossipStage:1] 2010-11-06 16:39:45,109 HintedHandOffManager.java
(line 151) Deleting any stored hints for /X.18
 INFO [GossipStage:1] 2010-11-06 16:39:45,116 ColumnFamilyStore.java
(line 631) switching in a fresh Memtable for HintsColumnFamily at
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1289078947182.log',
position=888)
 INFO [GossipStage:1] 2010-11-06 16:39:45,117 ColumnFamilyStore.java
(line 930) Enqueuing flush of memtable-hintscolumnfam...@1878733456(0
bytes, 0 operations)
 INFO [FlushWriter:1] 2010-11-06 16:39:45,118 Memtable.java (line 154)
Writing memtable-hintscolumnfam...@1878733456(0 bytes, 0 operations)
 INFO [ScheduledTasks:1] 2010-11-06 16:39:46,080 GCInspector.java (line
133) GC for ParNew: 320 ms, 151307960 reclaimed leaving 9401921352 used;
max is 34557919232
 INFO [FlushWriter:1] 2010-11-06 16:39:46,242 Memtable.java (line 161)
Completed flushing
/var/lib/cassandra/data/system/HintsColumnFamily-e-76-Data.db
 INFO [ScheduledTasks:1] 2010-11-06 16:39:53,921 Gossiper.java (line
195) InetAddress /X.22 is now dead.
 INFO [ScheduledTasks:1] 2010-11-06 16:39:54,922 Gossiper.java (line
195) InetAddress /X.20 is now dead.
 INFO [ScheduledTasks:1] 2010-11-06 16:39:55,924 Gossiper.java (line
195) InetAddress /X.19 is now dead.

loadbalance kills gossip?

Reply via email to