Re: Bad MapReduce job brings the Riak to a screeching halt?

Kelly McLaughlin Thu, 30 Aug 2012 15:16:28 -0700

On Aug 29, 2012, at 9:07 PM, Brad Heller <b...@cloudability.com> wrote:
> 
> So my question is: Why did this completely kill Riak? This makes me pretty 
> nervous--a bug in our app has the potential to bring down the ring! Is there 
> anything we can do to protect against this?
>


Riak 1.2 had a lot of changes to leveldb and one of those was a change to using 
flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. 
Previously, using fcntl, multiple processes in the erlang VM could get a lock 
to the same leveldb instance and this could obviously lead to some conflicts. 
However, a result of the change to using flock is that when the vnode crashes 
the resources can still be locked by the previous process and this results in 
this message:

        2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 
Failed to start riak_kv_multi_backend Reason: 
[{riak_kv_eleveldb_backend,{db_open,"IO error: lock 
../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]

Currently we do not attempt to wait or retry the vnode restart and this can 
cause the node to crash. I can understand you being a little nervous, but we 
are aware of this and are taking steps on two fronts to address it. First, as 
Bryan mentioned previously, we're looking at fixing these error conditions that 
cause the vnode to crash that really should not do so. Second, we're looking at 
a way to add some retry logic when the vnode does crash and the resources are 
locked. Thanks for the report!

Kelly
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Bad MapReduce job brings the Riak to a screeching halt?

Reply via email to