So, from my understanding, one of your servers was being replaced, so
you did a leave from the cluster, and it failed to commit, and then
another node failed, resulting in 3/4ish or 3/5ish of the ring being
up?
Did you down the failed node, or remove it from the cluster? What's
the current status of your ring?

Can you run:
riak-admin status (on each node)
riak-admin diag
riak-admin member-status
riak-admin ring-status
riak-admin vnode-status
riak-admin transfers
riak-admin transfer-limit

Can you put this all in a github gist: gist.github.com?
Also, are you running AAE?

You might want to temporarily turn off AAE, and transfers as you debug
the cluster, just to figure out what's going on with the cluster at
the moment.

On Thu, Nov 6, 2014 at 1:05 PM, Oleksiy Krivoshey <oleks...@gmail.com> wrote:
> Just got a new problem with Riak. Recently a hard drive has failed on one of
> Riak nodes so I had to shut it down. I'm running 4 nodes now and each 10
> minutes all of them start to fail with 'Error: {error,mailbox_overload}'
> until restarted. Can anyone from Basho please suggest a solution/ fix for
> this? My whole cluster is unusable with just 1 node failed.
>
> On 5 November 2014 00:11, Oleksiy Krivoshey <oleks...@gmail.com> wrote:
>>
>> There were also errors during initial handoff, here is a full console.log
>> for that day: https://www.dropbox.com/s/o7zop181pvpxoa5/console.log?dl=0
>>
>> I actually replaced two nodes that day. First one went smoothly as it
>> should. The second one resulted in the situation above. I replaced the first
>> one and then the second after few hours.
>>
>> On 4 November 2014 20:44, Oleksiy Krivoshey <oleks...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I'm running a 5 node cluster (Riak 2.0.0) and I had to replace hardware
>>> on one of the servers. So I did a 'cluster leave', waited till the node
>>> exited, checked the ring status and members status, all was ok, with no
>>> pending changes. Then later after about 5 minutes every client connection to
>>> any of the 4 remaining nodes started to fail with
>>>
>>> [Error: {error,mailbox_overload}
>>>
>>> I have restarted one node after another and the error has gone. However I
>>> was still experiencing connectivity issues (timeouts) and riak error log is
>>> full of various errors even after I joined the 5th node back.
>>>
>>> Error are like:
>>>
>>> Failed to merge
>>> {["/var/lib/riak/bitcask_expire_1d/685078892498860742907977265335757665463718379520/1.bitcask.data"]
>>>
>>> gen_fsm <0.818.0> in state active terminated with reason: bad record
>>> state in riak_kv_vnode:set_vnode_forwarding/2 line 991
>>>
>>> @riak_pipe_vnode:new_worker:826 Pipe worker startup failed:
>>>
>>>
>>> msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]
>>> 2014-11-04 16:07:57.124 [error]
>>> <0.11128.0>@riak_core_handoff_sender:start_fold:279 hinted_handoff transfer
>>> of riak_kv_vnode from 'riak@10.0.1.1'
>>> 353957427791078050502454920423474793822921162752 to 'riak@
>>> 10.0.1.5' 353957427791078050502454920423474793822921162752 failed because
>>> of error:undef
>>> [{riak_core_format,human_size_fmt,["~.2f",588],[]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_han
>>> doff_sender.erl"},{line,246}]}]
>>>
>>> The full error log file is available here:
>>> https://www.dropbox.com/s/3b8x3nqyego7lw3/error.log?dl=0
>>>
>>> There was no significant load on Riak so I would like to understand what
>>> caused so many errors?
>>>
>>> --
>>> Oleksiy
>>
>>
>>
>>
>> --
>> Oleksiy Krivoshey
>
>
>
>
> --
> Oleksiy Krivoshey
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to