We have a riak cluster on EC2 (large instances, ephemeral storage, ~5 nodes though currently 7 after some shuffling) that's seen multiple nodes go down over the last week due to corrupted merge_indexes. Certain boxes go down more frequently than others, but it's not predictable and it seems like any arbitrary box can be affected. The errors look similar to what I read about in this thread:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-July/008933.html except that it's occurring on multiple nodes, which prevents us from doing repairs from adjacent nodes. I've tried a few things: - Stop riak on single box, clear out merge_index/, restart riak. This works for several hours to about a day, but it eventually becomes corrupt again. - Stopping all nodes, clearing out all merge_index folders, restarting all nodes. Like above, this works for several hours but eventually we see corrupted merge indexes again. And obviously, this loses all past index data, so even if it worked it wouldn't be a suitable solution. I just needed to stop the nodes from going down. - Using Ryan Zezeski's script to detect bad MI files - there are too many for this to be a sustainable ongoing effort, though. https://gist.github.com/3250870 I spoke to Tom Santero who hypothesized that there could be something underlying in the EC2 infrastructure that's causing some corruption problems. We don't have a ticket off of EC2 right now, but what I am doing (as I write this) is widening the cluster to span three availability zones (all in US-East). My thinking is that if we continue to see problems but they' re all in a specific zone, that would confirm Tom's hypothesis (though the alternative would not disprove it). Below is some sample error.log/crash.log output. I'd be appreciative if anybody has thoughts as to what is causing these problems, or further tests I can run to diagnose/troubleshoot. Thanks in advance, Robby error.log: 2012-08-29 04:50:01.829 [error] <0.1909.0> CRASH REPORT Process <0.1909.0> with 0 neighbours exited with reason: bad argument in call to erlang:binary_to_term(<<131,108,0,0,0,4,104,4, 104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,...>>) in mi_buffer:read_value/2 line 162 in gen_server:init_it/6 line 328 2012-08-29 04:50:01.833 [error] <0.1908.0> CRASH REPORT Process <0.1908.0> with 0 neighbours exited with reason: no match of right hand value {error,{badarg,[{erlang,binary_to_term,[< <131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98 ,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107, 115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}} in merge_index_backend:start/2 line 47 in gen_fsm:init_it/6 line 379 2012-08-29 04:50:01.836 [error] <0.154.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1908.0> exit with reason no matc h of right hand value {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0 ,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100, 0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}} in merge_index_ backend:start/2 line 47 in context child_terminated 2012-08-29 04:50:01.839 [error] <0.1906.0> gen_server riak_core_vnode_manager terminated with reason: no match of right hand value {error,{{badmatch,{error,{badarg,[{erlang,binary_to_ term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,10 1,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,1 10,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in riak_core_vnode_manager:get_vnode/3 line 489 2012-08-29 04:50:01.858 [error] <0.1906.0> CRASH REPORT Process riak_core_vnode_manager with 0 neighbours exited with reason: no match of right hand value {error,{{badmatch,{error,{ba darg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100 ,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104 ,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in riak_core_vnode_manager:get_vnode/3 line 489 in gen_serve r:terminate/6 line 747 2012-08-29 04:50:01.881 [error] <0.152.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1906.0> exit with reason n o match of right hand value {error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110 ,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200, 4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in r iak_core_vnode_manager:get_vnode/3 line 489 in context child_terminated crash.log: 2012-08-29 04:48:19 =CRASH REPORT==== crasher: initial call: riak_core_vnode_manager:init/1 pid: <0.3501.0> registered_name: riak_core_vnode_manager exception exit: {{{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95 ,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93, 200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,97,116 ,45,72,111,109,101,45,77,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,10 8,0,0,0,1,104,2,100,0,1,112,107,0,1,31,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,2,104,49,109,0,0,0,3,99,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52, 100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,0,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,1 09,0,0,0,9,116,105,109,101,115,116,97,109,112,109,0,0,0,10,49,51,52,54,50,48,52,51,53,52,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54 ,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,0,0,1,43>>],[]},{mi_buffer,read_value,2,[{file,"src/mi_buffer.erl"},{line,162}]},{mi_buffer,o pen_inner,3,[{file,"src/mi_buffer.erl"},{line,70}]},{mi_buffer,new,1,[{file,"src/mi_buffer.erl"},{line,62}]},{mi_server,read_buffers,4,[{file,"src/mi_server.erl"},{line,613}]},{mi_ser ver,read_buf_and_seg,1,[{file,"src/mi_server.erl"},{line,585}]},{mi_server,init,1,[{file,"src/mi_server.erl"},{line,143}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]}] }}},[{merge_index_backend,start,2,[{file,"src/merge_index_backend.erl"},{line,47}]},{riak_search_vnode,init,1,[{file,"src/riak_search_vnode.erl"},{line,135}]},{riak_core_vnode,init,1, [{file,"src/riak_core_vnode.erl"},{line,123}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}},[{riak_core_vn ode_manager,get_vnode,3,[{file,"src/riak_core_vnode_manager.erl"},{line,489}]},{riak_core_vnode_manager,maybe_trigger_handoff,3,[{file,"src/riak_core_vnode_manager.erl"},{line,613}]}, {riak_core_vnode_manager,'-trigger_ownership_handoff/3-lc$^2/1-2-',2,[{file,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,trigger_ownership_handoff,3,[{file ,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,handle_cast,2,[{file,"src/riak_core_vnode_manager.erl"},{line,378}]},{gen_server,handle_msg,5,[{file,"gen_ser ver.erl"},{line,607}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,747}]},{proc_lib,init_p_do_apply,3,[{f ile,"proc_lib.erl"},{line,227}]}]} ancestors: [riak_core_sup,<0.150.0>] messages: [] links: [<0.151.0>] dictionary: [] trap_exit: false -- Robby Grossman @freerobby (http://twitter.com/freerobby) http://rob.by (http://rob.by/)
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com