Thanks for the emails detailing this issue - private and to the list. I've got a question for the list on our situation:
As stated we did an upgrade from 0.14.2 to 1.0.1 and after that we added a new node to our cluster. This really messed things up and nodes started crashing. In the end I opted to remove the added node and after quite a short while things settled down. The cluster is responding again. What we see now are corrupted files. We've tried to determine how many of them there are but it's been a bit difficult. What we know is that there ARE corrupted files(or at least returned in an inconsistent state). I was wondering if there is anything we can do to get the cluster in a proper state again without having to manually delete everything that's corrupted? Is it possible that the data is actually there but not returned in a proper state by riak? I think it's only the larger files stored in luwak that have this problem. John 29 okt 2011 kl. 01:03 skrev John Axel Eriksson: > I've got the utmost respect for developers such as yourselves(Basho) and > we've had great success using Riak - we have been using it > in production since 0.11. We've had our share of problems with it during this > whole time but none as big as this. I can't understand why > this wasn't posted somewhere using the blink tag and big red bold text. I > mean if I try to fsck a mounted disk in use in Linux I get: > > "WARNING!!! The filesystem is mounted. If you continue you ***WILL*** > cause ***SEVERE*** filesystem damage." > > I understand why I don't get a warning like that when trying to run > "riak-admin join r...@my.node.com" on Riak 1.0.1 but something similar to > it happens. > > It goes against the whole idea of Riak being an ops-dream, distributed, > fault-tolerant system having a bug such as this without disclosing it > more openly than an entry in a bug tracking system. I don't want to be afraid > of adding nodes to my cluster but that is the result of this bug and > the lack of communication of same bug. The 1.0.1 release should have been > pulled in my opinion. > > To sum it up, this was a nightmare for us, I didn't get much sleep last night > and I woke up in hell. All that, corrupted data, downtime and lost customer > confidence could have been avoided by better communication. > > I don't want to be too hard on you fine people of Basho and you provide a > really great system in Riak and I understand what you're aiming for, but if > anything as bad as this ever happens in the future you might want to > communicate it better and consider pulling the release. > > Thanks, > John > > > 28 okt 2011 kl. 17:51 skrev Kelly McLaughlin: > >> John, >> >> It appears you've run into a race condition with adding and leaving nodes >> that's present in 1.0.1. The problem happens during handoff and can cause >> bitcask directories to be unexpectedly deleted. We have identified the issue >> and we are in the process of correcting it, testing, and generating a new >> point release containing the fix. In the meantime, we apologize for the >> inconvenience and irritation this has caused. >> >> Kelly >> >> >> On Oct 28, 2011, at 9:14 AM, John Axel Eriksson wrote: >> >>> Last night we did two things. First we upgraded our entire cluster from >>> riak-search 0.14.2 to 1.0.1. This process went >>> pretty well and the cluster was responding correctly after this was >>> completed. >>> >>> In our cluster we have around 40 000 files stored in Luwak (we also have >>> about the same amount of keys, or more, in riak which is mostly >>> the metadata for the files in Luwak). The files are in sizes ranging from >>> around 50K to around 400MB, most of the files are pretty small though. I >>> think we're up to a total of around 30GB now. >>> >>> Anyway, upon adding a new node to the now 1.0.1 cluster I saw the beam.smp >>> processes on all the servers, including the new one, taking >>> up almost all available cpu. It stayed in this state for around an hour and >>> the cluster was slow to respond and occasionally timed out. During the >>> process Riak crashed on random nodes from time to time and I had to restart >>> it. After about an hour things settled down. I added this >>> new node to our load-balancer so it too could serve requests. When testing >>> our apps against the cluster we still got lots of timeouts and something >>> seemed very very wrong. >>> >>> After a while I did a "riak-admin leave" on the node that was added (kind >>> of a panic move I guess). Around 20 minutes after I did this, the cluster >>> started >>> responding correctly again. All was not well though - files seemed to be >>> corrupted(not sure what percentage but could be 1 % or more). I have no >>> idea how >>> that could happen but files that we had accessed before now contained >>> garbage. I haven't thoroughly researched exactly WHAT garbage they contain >>> but >>> they're not in a usable state anymore. Is this something that could happen >>> under any circumstances in Riak? >>> >>> I'm afraid of adding a node at all now since it resulted in downtime and >>> corruption when I tried it. I checked and rechecked the configuration files >>> and really - they're >>> the same on all the nodes (except for vm.args where they have different >>> names of course). Has anyone ever seen anything like this? Could it somehow >>> be related to >>> the fact that I did an upgrade from 0.14.2 to 1.0.1 and maybe an hour later >>> added a new 1.0.1 node? >>> >>> Thanks for any input! >>> >>> John >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com