Hi, The repair function triggers some errors : riak_search_vnode:repair(XXXX).
First, an unexpected message is received by mi_server gen_server : 13:27:20.165 [error] Unexpected info {#Port<0.17562700>,{data,[2,0,0,0,0,0,0,0,1|<<128>>]}} Then after some time we can see the same error as with search query : (sa_riak@172.16.0.121)32> 13:28:20.166 [error] gen_server <0.9186.0> terminated with reason: bad return value: lookup_timeout 13:28:20.166 [error] repair transfer of riak_search_vnode from ' sa_riak@172.16.0.121' 1438665674247607560106752257205091097473808596992 to ' sa_riak@172.16.0.110' 0 failed because of error:{badmatch,{error,{worker_crash,{bad_return_value,lookup_timeout},{fold,#Fun<merge_index_backend.1.86989574>,#Fun<riak_search_vnode.1.38892345>}}}} [{riak_core_handoff_sender,start_fold,5}] 13:28:20.167 [error] CRASH REPORT Process <0.9186.0> with 0 neighbours exited with reason: bad return value: lookup_timeout in gen_server:terminate/6 13:28:20.169 [error] Supervisor poolboy_sup had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.9186.0> exit with reason bad return value: lookup_timeout in context child_terminated Do you think this error comes from corrupted data ? Does anyone have seen this sort of error before ? Thank you again Regards. Arnaud Wetzel 2012/7/18 Arnaud Wetzel <arnaud.wet...@gmail.com> > Actually I had never seen this error before, and dont see it anymore now > (maybe it was because of the migration to 1.2.0-rc1). The problem is > difficult to describe because there are different errors every time I make > tests : here is a list of them (error appears during uncorrelated riak > search queries) : > {{nocatch,stream_timeout},[{riak_search_op_utils,gather_stream_results,4}]} > (correlated with other errors) > > {{badmatch,{error,emfile}},[{mi_segment,iterate_by_keyinfo,7},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,lookup,8}]} > > After the migration to 1.2.0-rc1 I saw (never saw this error before, > that's why it's not in the first mail) : > > {{badfun,#Fun<riak_search_client.9.8393097>},[{mi_server,iterate,6},{mi_server,lookup,8}]} > > But the main error (the one that appears most often) is : > > {error,{throw,{timeout,range_loop},[{riak_search_backend,collect_info_response,3},{riak_search_op_term,info,3},{riak_search_op_term,preplan,2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op_intersection,preplan,2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op_intersection,preplan,2}]}} > > My riak cluster has 5 nodes, now they all use the 1.2.0-rc1 version of > Riak, with default configuration in app.config. The ulimit is 2048 for all > nodes. > To avoid error during indexing, I had in vm.arg : > -env ERL_MAX_ETS_TABLES 50000 > > On each node there is approximately : > - 53G of merge_index data > - 26G of bitcast data > There are around two hundred different riak search indexes. > > The errors began after indexing many documents into riak search (there > were 3 nodes), a node reached is disk capacity so I had to add 2 nodes, > then restart the indexation that succeed, but the errors describe above > starts on some random search queries. > > Thank you very much for your answer, I will try the repair command on > every partitions tonight. > > Regards. > > 2012/7/18 Ryan Zezeski <rzeze...@basho.com> > >> The `badfun` is a new error. That wasn't in your original email. I'm >> not sure why you are seeing that. Are all your Riak nodes using 1.2.0-rc1? >> Can you give me more information on your cluster setup? Are there any >> other errors in you logs? The more information the more I can help. >> >> The repair "command" is not actually available from the command line yet. >> You need to attach to the Riak console to access it. The APIs are >> `riak_kv_vnode:repair(PartitionNumber)` and >> `riak_search_vnode:repair(PartitionNumber)`. >> >> >> On Wed, Jul 18, 2012 at 1:02 PM, Arnaud Wetzel >> <arnaud.wet...@gmail.com>wrote: >> >>> Ryan, >>> Increasing "ulimit -n" (current value is 4096, I have tested from 1024 >>> to 200000) does not change anything, always the same errors : >>> {timeout,range_loop} >>> lookup/range failure: >>> {{badfun,#Fun<riak_search_client.9.8393097>},[{mi_server,iterate,6},{mi_server,lookup,8}]} >>> >>> I cannot find the command "repair" that you talk about in your email (on >>> riak1.2.0-rc1), is it a function directly in an erlang module and not >>> accessible yet with riak-admin ? >>> >>> Thank you very much. >>> >>> -- >>> Arnaud Wetzel >>> KBRW Ad-Venture >>> 13 rue st Anastase, 75003 Paris >>> >>> 2012/7/16 Ryan Zezeski <rzeze...@basho.com> >>> >>>> Arnaud, >>>> >>>> The 'stream_timeout' and 'emfile' should be correlated. Whenever you >>>> see the 'emfile' you should see a corresponding timeout. The index server >>>> errors causing the result collector to timeout later. First, adjust your >>>> file descriptor limit and then go from there. >>>> >>>> For the 1.2 release a "repair" command has been added to rebuild KV or >>>> index data for a given partition. In releases before that you must reindex >>>> all your data. You don't have to worry about removing the current indexes >>>> as merge index will garbage collect that for you as it merges. As I said, >>>> first I would fix the 'emfile' issue and then see if further action is >>>> needed. >>>> >>>> -Z >>>> >>>> P.S. If you want to be absolutely sure what your FD limit is in Riak >>>> you can `riak attach` and then `os:cmd("ulimit -n").` Make sure to use >>>> Ctrl-D to exit from the Riak shell. >>>> >>>> On Mon, Jul 16, 2012 at 5:21 AM, Arnaud Wetzel <arnaud.wet...@gmail.com >>>> > wrote: >>>> >>>>> Hi, >>>>> Friday evening one of our riak node has reach his disk space limit >>>>> during indexing in riak-search. Then after adding some nodes, some >>>>> requests >>>>> fail, and it is impossible to find the correlation between requests with >>>>> error or those who succeed. >>>>> The errors are : >>>>> >>>>> {{nocatch,stream_timeout},[{riak_search_op_utils,gather_stream_results,4}]} >>>>> {timeout,range_loop} >>>>> >>>>> and sometimes (not always) : >>>>> >>>>> {{badmatch,{error,emfile}},[{mi_segment,iterate_by_keyinfo,7},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,lookup,8}]} >>>>> >>>>> So anyone else has experienced these errors ? Is it possible that they >>>>> come from the disk over limit error ? How can I try to repair merge index >>>>> data ? If it is not possible, what is the good process to delete entirely >>>>> all the indexes (only indexes, keeping riak datas). >>>>> >>>>> Thank you very much. >>>>> >>>>> Regards. >>>>> >>>>> -- >>>>> Arnaud Wetzel >>>>> KBRW Ad-Venture >>>>> 13 rue st Anastase, 75003 Paris >>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> riak-users@lists.basho.com >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>> >>> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com