Re: Severe problems when adding a new node

John Axel Eriksson Tue, 08 Nov 2011 02:36:12 -0800

Thanks for the emails detailing this issue - private and to the list. I've got 
a question for the list on our situation:


As stated we did an upgrade from 0.14.2 to 1.0.1 and after that we added a new 
node to our cluster. This
really messed things up and nodes started crashing. In the end I opted to 
remove the added node and after
quite a short while things settled down. The cluster is responding again. What 
we see now are corrupted files.

We've tried to determine how many of them there are but it's been a bit 
difficult. What we know is that there ARE
corrupted files(or at least returned in an inconsistent state). I was wondering 
if there is anything we can do to get
the cluster in a proper state again without having to manually delete 
everything that's corrupted? Is it possible that
the data is actually there but not returned in a proper state by riak? I think 
it's only the larger files stored in luwak
that have this problem.

John


29 okt 2011 kl. 01:03 skrev John Axel Eriksson:

> I've got the utmost respect for developers such as yourselves(Basho) and 
> we've had great success using Riak - we have been using it
> in production since 0.11. We've had our share of problems with it during this 
> whole time but none as big as this. I can't understand why
> this wasn't posted somewhere using the blink tag and big red bold text. I 
> mean if I try to fsck a mounted disk in use in Linux I get:
> 
> "WARNING!!!  The filesystem is mounted.   If you continue you ***WILL***
> cause ***SEVERE*** filesystem damage."
> 
> I understand why I don't get a warning like that when trying to run 
> "riak-admin join r...@my.node.com" on Riak 1.0.1 but something similar to
> it happens.
> 
> It goes against the whole idea of Riak being an ops-dream, distributed, 
> fault-tolerant system having a bug such as this without disclosing it
> more openly than an entry in a bug tracking system. I don't want to be afraid 
> of adding nodes to my cluster but that is the result of this bug and
> the lack of communication of same bug. The 1.0.1 release should have been 
> pulled in my opinion.
> 
> To sum it up, this was a nightmare for us, I didn't get much sleep last night 
> and I woke up in hell. All that, corrupted data, downtime and lost customer
> confidence could have been avoided by better communication.
> 
> I don't want to be too hard on you fine people of Basho and you provide a 
> really great system in Riak and I understand what you're aiming for, but if
> anything as bad as this ever happens in the future you might want to 
> communicate it better and consider pulling the release.
> 
> Thanks,
> John
> 
> 
> 28 okt 2011 kl. 17:51 skrev Kelly McLaughlin:
> 
>> John,
>> 
>> It appears you've run into a race condition with adding and leaving nodes 
>> that's present in 1.0.1. The problem happens during handoff and can cause 
>> bitcask directories to be unexpectedly deleted. We have identified the issue 
>> and we are in the process of correcting it, testing, and generating a new 
>> point release containing the fix. In the meantime, we apologize for the 
>> inconvenience and irritation this has caused. 
>> 
>> Kelly
>> 
>> 
>> On Oct 28, 2011, at 9:14 AM, John Axel Eriksson wrote:
>> 
>>> Last night we did two things. First we upgraded our entire cluster from 
>>> riak-search 0.14.2 to 1.0.1. This process went
>>> pretty well and the cluster was responding correctly after this was 
>>> completed.
>>> 
>>> In our cluster we have around 40 000 files stored in Luwak (we also have 
>>> about the same amount of keys, or more, in riak which is mostly
>>> the metadata for the files in Luwak). The files are in sizes ranging from 
>>> around 50K to  around 400MB, most of the files are pretty small though. I
>>> think we're up to a total of around 30GB now.
>>> 
>>> Anyway, upon adding a new node to the now 1.0.1 cluster I saw the beam.smp 
>>> processes on all the servers, including the new one, taking
>>> up almost all available cpu. It stayed in this state for around an hour and 
>>> the cluster was slow to respond and occasionally timed out. During the
>>> process Riak crashed on random nodes from time to time and I had to restart 
>>> it. After about an hour things settled down. I added this
>>> new node to our load-balancer so it too could serve requests. When testing 
>>> our apps against the cluster we still got lots of timeouts and something
>>> seemed very very wrong.
>>> 
>>> After a while I did a "riak-admin leave" on the node that was added (kind 
>>> of a panic move I guess). Around 20 minutes after I did this, the cluster 
>>> started
>>> responding correctly again. All was not well though - files seemed to be 
>>> corrupted(not sure what percentage but could be 1 % or more). I have no 
>>> idea how
>>> that could happen but files that we had accessed before now contained 
>>> garbage. I haven't thoroughly researched exactly WHAT garbage they contain 
>>> but
>>> they're not in a usable state anymore. Is this something that could happen 
>>> under any circumstances in Riak?
>>> 
>>> I'm afraid of adding a node at all now since it resulted in downtime and 
>>> corruption when I tried it. I checked and rechecked the configuration files 
>>> and really - they're
>>> the same on all the nodes (except for vm.args where they have different 
>>> names of course). Has anyone ever seen anything like this? Could it somehow 
>>> be related to
>>> the fact that I did an upgrade from 0.14.2 to 1.0.1 and maybe an hour later 
>>> added a new 1.0.1 node?
>>> 
>>> Thanks for any input!
>>> 
>>> John
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
> 


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Severe problems when adding a new node

Reply via email to