OK. Reconstructing the past failures is impractical, but I'm prepared for next time.
On 11/12/2010 6:38 PM, Jonathan Ellis wrote: > These are not expected. In order of increasing utility of fixing it > we could use > > - INFO level logs from when something went wrong; when streaming, > both source and target > - DEBUG level logs > - instructions for how to reproduce > > On Thu, Nov 11, 2010 at 7:46 PM, Reverend Chip <rev.c...@gmail.com> wrote: >> I've been running tests with a first four-node, then eight-node >> cluster. I started with 0.7.0 beta3, but have since updated to a more >> recent Hudson build. I've been happy with a lot of things, but I've had >> some really surprisingly unpleasant experiences with operational fragility. >> >> For example, when adding four nodes to a four-node cluster (at 2x >> replication), I had two nodes that insisted they were streaming data, >> but no progress was made in the stream for over a day (this was with >> beta3). I had to reboot the cluster to clear that condition. For the >> purpose of making progress on other tests I decided just to reload the >> data at eight-wide (with the more recent build), but if I had data I >> couldn't reload or the cluster were serving in production, that would >> have been a very inconvenient failure. >> >> I also had a node that refused to bootstrap immediately, but after I >> waited a day, it finally got its act together. >> >> I write this, not to complain per se, but to ask whether these failures >> are known & expected, and rebooting a cluster is just a Thing You Have >> To Do once in a while; or if not, what techniques can be used to clear >> such cluster topology and streaming/replication problems without rebooting. >> >> > >