We never have to reboot our production cluster. However we're not running a beta version but a release version (0.6.6). If your aim is to avoid fragility, it would seem sensible to run a release version as a good starting point.
dave On Friday, November 12, 2010, Reverend Chip <rev.c...@gmail.com> wrote: > I've been running tests with a first four-node, then eight-node > cluster. I started with 0.7.0 beta3, but have since updated to a more > recent Hudson build. I've been happy with a lot of things, but I've had > some really surprisingly unpleasant experiences with operational fragility. > > For example, when adding four nodes to a four-node cluster (at 2x > replication), I had two nodes that insisted they were streaming data, > but no progress was made in the stream for over a day (this was with > beta3). I had to reboot the cluster to clear that condition. For the > purpose of making progress on other tests I decided just to reload the > data at eight-wide (with the more recent build), but if I had data I > couldn't reload or the cluster were serving in production, that would > have been a very inconvenient failure. > > I also had a node that refused to bootstrap immediately, but after I > waited a day, it finally got its act together. > > I write this, not to complain per se, but to ask whether these failures > are known & expected, and rebooting a cluster is just a Thing You Have > To Do once in a while; or if not, what techniques can be used to clear > such cluster topology and streaming/replication problems without rebooting. > > -- *Dave Gardner* Technical Architect [image: imagini_58mmX15mm.png] [image: VisualDNA-Logo-small.png] *Imagini Europe Limited* 7 Moor Street, London W1D 5NB [image: phone_icon.png] +44 20 7734 7033 [image: skype_icon.png] daveg79 [image: emailIcon.png] dave.gard...@imagini.net [image: icon-web.png] http://www.visualdna.com Imagini Europe Limited, Company number 5565112 (England and Wales), Registered address: c/o Bird & Bird, 90 Fetter Lane, London, EC4A 1EQ, United Kingdom