On Aug 29, 2012, at 9:07 PM, Brad Heller <b...@cloudability.com> wrote: > > So my question is: Why did this completely kill Riak? This makes me pretty > nervous--a bug in our app has the potential to bring down the ring! Is there > anything we can do to protect against this? >
Riak 1.2 had a lot of changes to leveldb and one of those was a change to using flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. Previously, using fcntl, multiple processes in the erlang VM could get a lock to the same leveldb instance and this could obviously lead to some conflicts. However, a result of the change to using flock is that when the vnode crashes the resources can still be locked by the previous process and this results in this message: 2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}] Currently we do not attempt to wait or retry the vnode restart and this can cause the node to crash. I can understand you being a little nervous, but we are aware of this and are taking steps on two fronts to address it. First, as Bryan mentioned previously, we're looking at fixing these error conditions that cause the vnode to crash that really should not do so. Second, we're looking at a way to add some retry logic when the vnode does crash and the resources are locked. Thanks for the report! Kelly _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com