We have a riak cluster on EC2 (large instances, ephemeral storage, ~5 nodes 
though currently 7 after some shuffling) that's seen multiple nodes go down 
over the last week due to corrupted merge_indexes. Certain boxes go down more 
frequently than others, but it's not predictable and it seems like any 
arbitrary box can be affected. The errors look similar to what I read about in 
this thread:

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-July/008933.html

except that it's occurring on multiple nodes, which prevents us from doing 
repairs from adjacent nodes.

I've tried a few things:

- Stop riak on single box, clear out merge_index/, restart riak. This works for 
several hours to about a day, but it eventually becomes corrupt again.
- Stopping all nodes, clearing out all merge_index folders, restarting all 
nodes. Like above, this works for several hours but eventually we see corrupted 
merge indexes again. And obviously, this loses all past index data, so even if 
it worked it wouldn't be a suitable solution. I just needed to stop the nodes 
from going down.
- Using Ryan Zezeski's script to detect bad MI files - there are too many for 
this to be a sustainable ongoing effort, though. https://gist.github.com/3250870

I spoke to Tom Santero who hypothesized that there could be something 
underlying in the EC2 infrastructure that's causing some corruption problems. 
We don't have a ticket off of EC2 right now, but what I am doing (as I write 
this) is widening the cluster to span three availability zones (all in 
US-East). My thinking is that if we continue to see problems but they' re all 
in a specific zone, that would confirm Tom's hypothesis (though the alternative 
would not disprove it).

Below is some sample error.log/crash.log output. I'd be appreciative if anybody 
has thoughts as to what is causing these problems, or further tests I can run 
to diagnose/troubleshoot.

Thanks in advance,
Robby

error.log:
2012-08-29 04:50:01.829 [error] <0.1909.0> CRASH REPORT Process <0.1909.0> with 
0 neighbours exited with reason: bad argument in call to 
erlang:binary_to_term(<<131,108,0,0,0,4,104,4,
104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,...>>) in 
mi_buffer:read_value/2 line 162 in gen_server:init_it/6 line 328
2012-08-29 04:50:01.833 [error] <0.1908.0> CRASH REPORT Process <0.1908.0> with 
0 neighbours exited with reason: no match of right hand value 
{error,{badarg,[{erlang,binary_to_term,[<
<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98
,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,
115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}}
 in merge_index_backend:start/2 line 47 in gen_fsm:init_it/6 line 379
2012-08-29 04:50:01.836 [error] <0.154.0> Supervisor riak_core_vnode_sup had 
child undefined started with {riak_core_vnode,start_link,undefined} at 
<0.1908.0> exit with reason no matc
h of right hand value 
{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0
,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,
0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}}
 in merge_index_
backend:start/2 line 47 in context child_terminated
2012-08-29 04:50:01.839 [error] <0.1906.0> gen_server riak_core_vnode_manager 
terminated with reason: no match of right hand value 
{error,{{badmatch,{error,{badarg,[{erlang,binary_to_
term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,10
1,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,1
10,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}}
 in riak_core_vnode_manager:get_vnode/3 line 489
2012-08-29 04:50:01.858 [error] <0.1906.0> CRASH REPORT Process 
riak_core_vnode_manager with 0 neighbours exited with reason: no match of right 
hand value {error,{{badmatch,{error,{ba
darg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100
,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104
,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}}
 in riak_core_vnode_manager:get_vnode/3 line 489 in gen_serve
r:terminate/6 line 747
2012-08-29 04:50:01.881 [error] <0.152.0> Supervisor riak_core_sup had child 
riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at 
<0.1906.0> exit with reason n
o match of right hand value 
{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110
,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,
4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}}
 in r
iak_core_vnode_manager:get_vnode/3 line 489 in context child_terminated


crash.log:
2012-08-29 04:48:19 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode_manager:init/1
    pid: <0.3501.0>
    registered_name: riak_core_vnode_manager
    exception exit: 
{{{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95
,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,
200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,97,116
,45,72,111,109,101,45,77,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,10
8,0,0,0,1,104,2,100,0,1,112,107,0,1,31,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,2,104,49,109,0,0,0,3,99,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,
100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,0,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,1
09,0,0,0,9,116,105,109,101,115,116,97,109,112,109,0,0,0,10,49,51,52,54,50,48,52,51,53,52,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54
,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,0,0,1,43>>],[]},{mi_buffer,read_value,2,[{file,"src/mi_buffer.erl"},{line,162}]},{mi_buffer,o
pen_inner,3,[{file,"src/mi_buffer.erl"},{line,70}]},{mi_buffer,new,1,[{file,"src/mi_buffer.erl"},{line,62}]},{mi_server,read_buffers,4,[{file,"src/mi_server.erl"},{line,613}]},{mi_ser
ver,read_buf_and_seg,1,[{file,"src/mi_server.erl"},{line,585}]},{mi_server,init,1,[{file,"src/mi_server.erl"},{line,143}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]}]
}}},[{merge_index_backend,start,2,[{file,"src/merge_index_backend.erl"},{line,47}]},{riak_search_vnode,init,1,[{file,"src/riak_search_vnode.erl"},{line,135}]},{riak_core_vnode,init,1,
[{file,"src/riak_core_vnode.erl"},{line,123}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}},[{riak_core_vn
ode_manager,get_vnode,3,[{file,"src/riak_core_vnode_manager.erl"},{line,489}]},{riak_core_vnode_manager,maybe_trigger_handoff,3,[{file,"src/riak_core_vnode_manager.erl"},{line,613}]},
{riak_core_vnode_manager,'-trigger_ownership_handoff/3-lc$^2/1-2-',2,[{file,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,trigger_ownership_handoff,3,[{file
,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,handle_cast,2,[{file,"src/riak_core_vnode_manager.erl"},{line,378}]},{gen_server,handle_msg,5,[{file,"gen_ser
ver.erl"},{line,607}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,747}]},{proc_lib,init_p_do_apply,3,[{f
ile,"proc_lib.erl"},{line,227}]}]}
    ancestors: [riak_core_sup,<0.150.0>]
    messages: []
    links: [<0.151.0>]
    dictionary: []
    trap_exit: false



-- 
Robby Grossman
@freerobby (http://twitter.com/freerobby)
http://rob.by (http://rob.by/)


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to