Hi Glory,

On Tue, Feb 11, 2014 at 1:29 AM, Glory Lo <gloryl...@gmail.com> wrote:
>
>
> While indexing it seem to run fine part way.. then I noticed it hangs (it
> freezed my machine on a couple of attempts on linux mint 13).  Then it
> crashes.  I have 3 nodes running and I only tried indexing one of them
> doing a search-cmd mybucket dev1/data/leveldb
>

What was the process for indexing? How much data were you indexing? What
content-type? How big is each object? What is your schema?


> My crash log has multiple errors of different sorts which I haven't
> discern yet.  However, the last errors w/ a close timestamp are as follows
> which mentions some timeouts (likely with the freeze):
>

It's hard to discern ripple effect errors from the the origin error. I see
some stuff that is indicative of disk corruption but there's a good chance
that only happened because some other error caused merge_index to hard
crash. Could you attach a tar.gz of all your logs?


>
> 2014-02-08 23:15:53 =ERROR REPORT====
> Error in process <0.2799.1> on node 'dev1@127.0.0.1' with exit value:
> {badarg,[{ets,lookup,[145752322,{1118962191081472546749696200048404186924073353216,'
> dev2@127.0.0.1
> '}],[]},{riak_search_client,'-process_terms_1/4-fun-2-',3,[{file,"src/riak_search_client.erl"},{line,295}]},{riak_search_utils,'-ptransform/2-fun-0-',2,[{file,"src/riak_search_utils....
>

This is an error finding the temporary ETS table for building the postings
list. That's a really interesting error to have and makes me wonder if you
someone hit the ETS system limit. I'm not even sure that is possible given
how high we've raised the default limit.


>
> 2014-02-08 23:18:46 =ERROR REPORT====
> Error in process <0.2350.1> on node 'dev1@127.0.0.1' with exit value:
> {terminated,[{io,format,[<17869.23.0>,"DEBUG: ~p:~p - ~p~n~n
> ~p~n~n",[riak_search_dir_indexer,194,"{ error , Type , Error , erlang :
> get_stacktrace ( )
> }",{error,error,{case_clause,{error,timeout}},[{riak_search_client,'-index_docs/1-fun-0-'...
>

I'm actually a bit baffled exactly what this trace is saying. I think more
detail might be in the error.log.


>
> 2014-02-08 23:20:00 =ERROR REPORT====
> Error in process <0.4231.1> on node 'dev1@127.0.0.1' with exit value:
> {{case_clause,{data,4711}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,227}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
>

Yikes, this looks really bad and makes me wonder if this is an environment
issue as this error should not be related to search.


>
> 2014-02-08 23:23:37 =ERROR REPORT====
> Error in process <0.6359.1> on node 'dev1@127.0.0.1' with exit value:
> {badarg,[{erlang,binary_to_term,[<<31359
> bytes>>],[]},{mi_segment,iterate_all_bytes,2,[{file,"src/mi_segment.erl"},{line,167}]},{mi_segment_writer,from_iterator,4,[{file,"src/mi_segment_writer.erl"},{line,102}]},{mi_segment_writer,from_iterator...
>

This is typically what you see when data corruption occurs but it's hard to
say if data corruption caused the other errors of the other errors caused
corruption.


>
>
>
> 2014-02-08 23:24:58 =ERROR REPORT====
> ** State machine <0.3211.0> terminating
> ** Last message in was {'EXIT',<0.168.0>,shutdown}
> ** When State == active
> **      Data  ==
> {state,1438665674247607560106752257205091097473808596992,riak_search_vnode,{vstate,1438665674247607560106752257205091097473808596992,merge_index_backend,{state,1438665674247607560106752257205091097473808596992,<0.3212.0>}},undefined,none,undefined,undefined,<0.3221.0>,{pool,riak_search_worker,2,[]},undefined,86616}
> ** Reason for termination =
> ** {timeout,{gen_server,call,[<0.3212.0>,stop]}}
> 2014-02-08 23:24:58 =CRASH REPORT====
>   crasher:
>     initial call: riak_core_vnode:init/1
>     pid: <0.3211.0>
>     registered_name: []
>     exception exit:
> {{timeout,{gen_server,call,[<0.3212.0>,stop]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,589}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>     ancestors: [riak_core_vnode_sup,riak_core_sup,<0.162.0>]
>     messages:
> [{'EXIT',<0.3221.0>,shutdown},{#Ref<0.0.1.215952>,ok},{'EXIT',<0.3212.0>,normal}]
>     links: []
>     dictionary: [{random_seed,{27839,21123,25074}}]
>     trap_exit: true
>     status: running
>     heap_size: 46368
>     stack_size: 24
>     reductions: 24758
>   neighbours:
>

This is one of the riak_search vnodes crashing because it's merge index
process crashed. Which is expected given the circumstances.


> 2014-02-08 23:24:58 =ERROR REPORT====
> ** State machine <0.5392.1> terminating
> ** Last message in was
> {'$gen_sync_all_state_event',{<0.5390.1>,#Ref<0.0.1.215861>},{shutdown,60000}}
> ** When State == ready
> **      Data  == {state,{[],[]},<0.5393.1>,[],undefined}
> ** Reason for termination =
> ** {timeout,{gen_fsm,sync_send_all_state_event,[<0.5393.1>,stop]}}
> 2014-02-08 23:24:58 =CRASH REPORT====
>   crasher:
>     initial call: riak_core_vnode_worker_pool:init/1
>     pid: <0.5392.1>
>     registered_name: []
>     exception exit:
> {{timeout,{gen_fsm,sync_send_all_state_event,[<0.5393.1>,stop]}},[{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,511}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>     ancestors: [<0.5390.1>,riak_core_vnode_sup,riak_core_sup,<0.162.0>]
>     messages: []
>     links: [<0.5390.1>,<0.5393.1>]
>     dictionary: []
>     trap_exit: false
>     status: running
>     heap_size: 233
>     stack_size: 24
>     reductions: 225
>   neighbours:
>

This is the worker pool crashing probably because it's vnode crashed.


> 2014-02-08 23:25:01 =SUPERVISOR REPORT====
>      Supervisor: {local,riak_core_vnode_sup}
>      Context:    shutdown_error
>      Reason:     {timeout,{gen_server,call,[<0.5436.1>,stop]}}
>      Offender:
> [{nb_children,1},{name,undefined},{mfargs,{riak_core_vnode,start_link,[]}},{restart_type,temporary},{shutdown,300000},{child_type,worker}]
>

Supervisor reports just indicating that vnodes have crashed because of a
timeout. Expected given the circumstances.


>
> Can someone provide some guidance as to where to troubleshoot the issue?
>  If it's timing out which is a mere symptom of it being in hang state.
>  What however is the root cause of it being stuck with those other errors
> like bad arg and bad match.
>

Attach your logs and I should be able to take a closer look.

-Z
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to