I've been running tests with a first four-node, then eight-node
cluster.  I started with 0.7.0 beta3, but have since updated to a more
recent Hudson build.  I've been happy with a lot of things, but I've had
some really surprisingly unpleasant experiences with operational fragility.

For example, when adding four nodes to a four-node cluster (at 2x
replication), I had two nodes that insisted they were streaming data,
but no progress was made in the stream for over a day (this was with
beta3).  I had to reboot the cluster to clear that condition.  For the
purpose of making progress on other tests I decided just to reload the
data at eight-wide (with the more recent build), but if I had data I
couldn't reload or the cluster were serving in production, that would
have been a very inconvenient failure.

I also had a node that refused to bootstrap immediately, but after I
waited a day, it finally got its act together. 

I write this, not to complain per se, but to ask whether these failures
are known & expected, and rebooting a cluster is just a Thing You Have
To Do once in a while; or if not, what techniques can be used to clear
such cluster topology and streaming/replication problems without rebooting.

Reply via email to