Re: Severe problems when adding a new node

Aphyr Fri, 28 Oct 2011 08:31:58 -0700

I was waiting for Basho to write an official notice about this, but it'sbeen three days and I really don't want anyone else to go through thisshitshow.

1.0.1 contains a race condition which can cause vnodes to crash duringpartition drop. This crash will kill the entire riak process. On oursix-node, 1024 partition cluster, during riak-admin leave, weexperienced roughly one crash per minute for over an hour. Basho'sherculean support efforts got us a patch which forces vnode drop to besynchronous; leave-join is quite stable with this change.


https://issues.basho.com/show_bug.cgi?id=1263

I strongly encourage 1.0.1 users to avoid using riak-admin join andriak-admin leave until this patch is available.


--Kyle

On 10/28/2011 08:14 AM, John Axel Eriksson wrote:

Last night we did two things. First we upgraded our entire cluster from 
riak-search 0.14.2 to 1.0.1. This process went
pretty well and the cluster was responding correctly after this was completed.

In our cluster we have around 40 000 files stored in Luwak (we also have about 
the same amount of keys, or more, in riak which is mostly
the metadata for the files in Luwak). The files are in sizes ranging from 
around 50K to  around 400MB, most of the files are pretty small though. I
think we're up to a total of around 30GB now.

Anyway, upon adding a new node to the now 1.0.1 cluster I saw the beam.smp 
processes on all the servers, including the new one, taking
up almost all available cpu. It stayed in this state for around an hour and the 
cluster was slow to respond and occasionally timed out. During the
process Riak crashed on random nodes from time to time and I had to restart it. 
After about an hour things settled down. I added this
new node to our load-balancer so it too could serve requests. When testing our 
apps against the cluster we still got lots of timeouts and something
seemed very very wrong.

After a while I did a "riak-admin leave" on the node that was added (kind of a 
panic move I guess). Around 20 minutes after I did this, the cluster started
responding correctly again. All was not well though - files seemed to be 
corrupted(not sure what percentage but could be 1 % or more). I have no idea how
that could happen but files that we had accessed before now contained garbage. 
I haven't thoroughly researched exactly WHAT garbage they contain but
they're not in a usable state anymore. Is this something that could happen 
under any circumstances in Riak?

I'm afraid of adding a node at all now since it resulted in downtime and 
corruption when I tried it. I checked and rechecked the configuration files and 
really - they're
the same on all the nodes (except for vm.args where they have different names 
of course). Has anyone ever seen anything like this? Could it somehow be 
related to
the fact that I did an upgrade from 0.14.2 to 1.0.1 and maybe an hour later 
added a new 1.0.1 node?

Thanks for any input!

John
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Severe problems when adding a new node

Reply via email to