Hey Kelly, Bryan,

Thanks for the replies. Good to hear this is being worked on! And sorry I 
didn't elaborate on "crashed." In this instance crashed meant "stopped taking 
connections on the HTTP interface." I didn't check to see if the Beam processes 
died (I think they did as load decreased).

I bumped my ulimit -n based on previous suggestions and that seemed to help. 
If/when I run in to this again I will indeed post more details!


>> So my question is: Why did this completely kill Riak? This makes me pretty 
>> nervous--a bug in our app has the potential to bring down the ring! Is there 
>> anything we can do to protect against this?
> Riak 1.2 had a lot of changes to leveldb and one of those was a change to 
> using flock() instead of fcntl(SET_FL) to try and make the locking a bit 
> saner. Previously, using fcntl, multiple processes in the erlang VM could get 
> a lock to the same leveldb instance and this could obviously lead to some 
> conflicts. However, a result of the change to using flock is that when the 
> vnode crashes the resources can still be locked by the previous process and 
> this results in this message:
>       2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 
> Failed to start riak_kv_multi_backend Reason: 
> [{riak_kv_eleveldb_backend,{db_open,"IO error: lock 
> ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
> Currently we do not attempt to wait or retry the vnode restart and this can 
> cause the node to crash. I can understand you being a little nervous, but we 
> are aware of this and are taking steps on two fronts to address it. First, as 
> Bryan mentioned previously, we're looking at fixing these error conditions 
> that cause the vnode to crash that really should not do so. Second, we're 
> looking at a way to add some retry logic when the vnode does crash and the 
> resources are locked. Thanks for the report!
> Kelly

