Re: Cluster fragility

Reverend Chip Sat, 13 Nov 2010 12:08:20 -0800

OK.  Reconstructing the past failures is impractical, but I'm prepared
for next time.


On 11/12/2010 6:38 PM, Jonathan Ellis wrote:
> These are not expected.  In order of increasing utility of fixing it
> we could use
>
>  - INFO level logs from when something went wrong; when streaming,
> both source and target
>  - DEBUG level logs
>  - instructions for how to reproduce
>
> On Thu, Nov 11, 2010 at 7:46 PM, Reverend Chip <rev.c...@gmail.com> wrote:
>> I've been running tests with a first four-node, then eight-node
>> cluster.  I started with 0.7.0 beta3, but have since updated to a more
>> recent Hudson build.  I've been happy with a lot of things, but I've had
>> some really surprisingly unpleasant experiences with operational fragility.
>>
>> For example, when adding four nodes to a four-node cluster (at 2x
>> replication), I had two nodes that insisted they were streaming data,
>> but no progress was made in the stream for over a day (this was with
>> beta3).  I had to reboot the cluster to clear that condition.  For the
>> purpose of making progress on other tests I decided just to reload the
>> data at eight-wide (with the more recent build), but if I had data I
>> couldn't reload or the cluster were serving in production, that would
>> have been a very inconvenient failure.
>>
>> I also had a node that refused to bootstrap immediately, but after I
>> waited a day, it finally got its act together.
>>
>> I write this, not to complain per se, but to ask whether these failures
>> are known & expected, and rebooting a cluster is just a Thing You Have
>> To Do once in a while; or if not, what techniques can be used to clear
>> such cluster topology and streaming/replication problems without rebooting.
>>
>>
>
>

Re: Cluster fragility

Reply via email to