Running dev setup based on "The Riak Fast Track", nodes crashing during re-add to index existing documents.

Ted Cooper Mon, 17 Sep 2012 18:45:47 -0700

I'm running a 4-"node" cluster on one machine, riak-1.2.0.  The
configuration is very close to the default development environment setup,
except I've turned on riak search in app.config for each node and added the
indexing pre-commit hook and a schema (I've tested it on individual
documents and it indexes them correctly) for one bucket.  I added ~5mm
documents to this bucket before I turned search on, and from what I've been
told the best way to re-index existing documents is to re-add each
(search:index_doc doesn't seem to do anything for me).  I'm trying to do
this from the local console of one of the nodes in the cluster as follows:


{ok, C} = riak:local_client().
{ok, Keys} = C:list_keys(<<"user">>).
plists:foreach(fun(Key) -> {ok, Doc} = C:get(<<"user">>, Key), C:put(Doc)
end, Keys, {processes, 8}).

8 processes reading from/writing to one cluster in parallel shouldn't be a
problem and hopefully reduces the amount of time wasted waiting for IO.
 Each key is only re-added by a single process, so there should be no
issues with consensus, right?

This has failed every time I've tried to do it for one reason or another.
 This run, the first bad thing that happened was node 4 going down.  From
dev/dev4/log/crash.log:
2012-09-17 20:08:15 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.486.0>
    registered_name: []
    exception exit:
{{bad_return,{{riak_search_app,start,[normal,[]]},{'EXIT',{badarg,[{ets,lookup,[riak_core_node_watcher,{by_node,'
dev4@127.0.0.1
'}],[]},{riak_core_node_watcher,internal_get_services,1,[{file,"src/riak_core_node_watcher.erl"},{line,412}]},{riak_core,wait_for_service,2,[{file,"src/riak_core.erl"},{line,435}]},{riak_search_app,start,2,[{file,"src/riak_search_app.erl"},{line,22}]},{application_master,start_it_old,4,[{file,"application_master.erl"},{line,274}]}]}}}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    ancestors: [<0.485.0>]
    messages: [{'EXIT',<0.487.0>,normal}]
    links: [<0.485.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 184
  neighbours:
2012-09-17 20:10:02 =ERROR REPORT====
Error in process <0.629.0> on node 'dev4@127.0.0.1' with exit value:
{function_clause,[{proplists,get_value,[one,{error,{riak_api,pbc_connects},nonexistent_metric},undefined],[{file,"proplists.erl"},{line,222}]},{riak_kv_stat,backwards_compat,3,[{file,"src/riak_kv_stat.erl"},{line,337}]},{riak_kv_stat...

Watching the console on node 1, I saw a lot of errors like:
20:05:02.546 [error] Supervisor riak_kv_put_fsm_sup had child undefined
started with {riak_kv_put_fsm,start_link,undefined} at <0.5165.19> exit
with reason {{nodedown,'dev4@127.0.0.1
'},{gen_server,call,[{riak_search_vnode_master,'dev4@127.0.0.1'},{riak_vnode_req_v1,890602560248518965780370444936484965102833893376,{server,undefined,undefined},{index_v1,[{<<"user">>,<<"user_profile_user_app_stat_fsh">>,<<"8140">>,<<"503ea1bb81340f1ff4b0dcdd">>,[{p,[0]}],1347926701849954},{<<"user">>,<<"_id">>,<<"503ea1bb81340f1ff4b0dcdd">>,<<"503ea1bb81340f1ff4b0dcdd">>,[{p,[0]}],1347926701849954}]}},infinity]}}
in context child_terminated

I started node 4 back up and allowed the operation to continue.  Later node
1, where I was using the console to re-add documents, went down.

console.log shows many of these:
20:13:03.159 [info] Starting hinted_handoff transfer of riak_kv_vnode from '
dev1@127.0.0.1' 1164634117248063262943561351070788031288321245184 to '
dev4@127.0.0.1' 1164634117248063262943561351070788031288321245184
...
20:13:21.660 [info] An outbound handoff of partition riak_search_vnode
251195593916248939066258330623111144003363405824 was terminated for reason:
{shutdown,max_concurrency}

along with successful compaction entries.

error.log has no entries contemporaneous to the crash.

crash.log has no entries contemporaneous to the crash, just older entries
from when node 4 went down:
2012-09-17 20:05:02 =SUPERVISOR REPORT====
     Supervisor: {local,riak_kv_put_fsm_sup}
     Context:    child_terminated
     Reason:     {{nodedown,'dev4@127.0.0.1
'},{gen_server,call,[{riak_search_vnode_master,'dev4@127.0.0.1
'},{riak_vnode_req_v1,890602560248518965780370444936484965102833893376,{server,undefined,undefined},{index_v1,[{<<"user">>,<<"user_profile_user_app_stat_fsh">>,<<"8140">>,<<"503ea1bb81340f1ff4b0dcdd">>,[{p,[0]}],1347926701849954},{<<"user">>,<<"_id">>,<<"503ea1bb81340f1ff4b0dcdd">>,<<"503ea1bb81340f1ff4b0dcdd">>,[{p,[0]}],1347926701849954}]}},infinity]}}
     Offender:
[{pid,<0.5165.19>},{name,undefined},{mfargs,{riak_kv_put_fsm,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

Actual console output ends like this:
20:13:21.755 [info] An outbound handoff of partition riak_search_vnode
1255977969581244695331291653115555720016817029120 was terminated for
reason: {shutdown,max_concurrency}
20:13:24.090 [info] hinted_handoff transfer of riak_search_vnode from '
dev1@127.0.0.1' 342539446249430371453988632667878832731859189760 to '
dev4@127.0.0.1' 342539446249430371453988632667878832731859189760 completed:
sent 12953 objects in 2.42 seconds
20:13:27.511 [info] Pid <0.1421.0> compacted 3 segments for 4446992 bytes
in 3.454234 seconds, 1.23 MB/sec
20:13:37.676 [info] Pid <0.1315.0> compacted 3 segments for 3540694 bytes
in 2.156376 seconds, 1.57 MB/sec
20:13:39.907 [info] Pid <0.1408.0> compacted 3 segments for 3645337 bytes
in 2.230437 seconds, 1.56 MB/sec
20:13:42.860 [info] Pid <0.1439.0> compacted 3 segments for 3242485 bytes
in 1.951775 seconds, 1.58 MB/sec
20:13:47.046 [info] Pid <0.1246.0> compacted 3 segments for 2196290 bytes
in 1.181681 seconds, 1.77 MB/sec
Segmentation fault: 11

This isn't in keeping with the stability other people seem to have with
riak, so I'm guessing my cluster is misconfigured.  I can attach full logs,
app.config, and anything else if needed.

Thanks,
Ted

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Running dev setup based on "The Riak Fast Track", nodes crashing during re-add to index existing documents.

Reply via email to