Re: Range Loop Timeout Error after (after disk space over limit)

Arnaud Wetzel Thu, 19 Jul 2012 06:38:05 -0700

Hi,
The repair function triggers some errors :
riak_search_vnode:repair(XXXX).


First, an unexpected message is received by mi_server gen_server :

13:27:20.165 [error] Unexpected info
{#Port<0.17562700>,{data,[2,0,0,0,0,0,0,0,1|<<128>>]}}

Then after some time we can see the same error as with search query :

(sa_riak@172.16.0.121)32> 13:28:20.166 [error] gen_server <0.9186.0>
terminated with reason: bad return value: lookup_timeout
13:28:20.166 [error] repair transfer of riak_search_vnode from '
sa_riak@172.16.0.121' 1438665674247607560106752257205091097473808596992 to '
sa_riak@172.16.0.110' 0 failed because of
error:{badmatch,{error,{worker_crash,{bad_return_value,lookup_timeout},{fold,#Fun<merge_index_backend.1.86989574>,#Fun<riak_search_vnode.1.38892345>}}}}
[{riak_core_handoff_sender,start_fold,5}]
13:28:20.167 [error] CRASH REPORT Process <0.9186.0> with 0 neighbours
exited with reason: bad return value: lookup_timeout in
gen_server:terminate/6
13:28:20.169 [error] Supervisor poolboy_sup had child
riak_core_vnode_worker started with
{riak_core_vnode_worker,start_link,undefined} at <0.9186.0> exit with
reason bad return value: lookup_timeout in context child_terminated

Do you think this error comes from corrupted data ? Does anyone have seen
this sort of error before ?

Thank you again
Regards.

Arnaud Wetzel

2012/7/18 Arnaud Wetzel <arnaud.wet...@gmail.com>

> Actually I had never seen this error before, and dont see it anymore now
> (maybe it was because of the migration to 1.2.0-rc1). The problem is
> difficult to describe because there are different errors every time I make
> tests : here is a list of them (error appears during uncorrelated riak
> search queries) :
>  {{nocatch,stream_timeout},[{riak_search_op_utils,gather_stream_results,4}]}
> (correlated with other errors)
>
> {{badmatch,{error,emfile}},[{mi_segment,iterate_by_keyinfo,7},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,lookup,8}]}
>
> After the migration to 1.2.0-rc1 I saw (never saw this error before,
> that's why it's not in the first mail) :
>
> {{badfun,#Fun<riak_search_client.9.8393097>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
>
> But the main error (the one that appears most often) is :
>
> {error,{throw,{timeout,range_loop},[{riak_search_backend,collect_info_response,3},{riak_search_op_term,info,3},{riak_search_op_term,preplan,2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op_intersection,preplan,2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op,'-preplan/2-lc$^0/1-0-',2},{riak_search_op_intersection,preplan,2}]}}
>
> My riak cluster has 5 nodes, now they all use the 1.2.0-rc1 version of
> Riak, with default configuration in app.config. The ulimit is 2048 for all
> nodes.
> To avoid error during indexing, I had in vm.arg :
> -env ERL_MAX_ETS_TABLES 50000
>
> On each node there is approximately :
> - 53G of merge_index data
> - 26G of bitcast data
> There are around two hundred different riak search indexes.
>
> The errors began after indexing many documents into riak search (there
> were 3 nodes), a node reached is disk capacity so I had to add 2 nodes,
> then restart the indexation that succeed, but the errors describe above
> starts on some random search queries.
>
> Thank you very much for your answer, I will try the repair command on
> every partitions tonight.
>
> Regards.
>
> 2012/7/18 Ryan Zezeski <rzeze...@basho.com>
>
>> The `badfun` is a new error.  That wasn't in your original email.  I'm
>> not sure why you are seeing that.  Are all your Riak nodes using 1.2.0-rc1?
>>  Can you give me more information on your cluster setup?  Are there any
>> other errors in you logs?  The more information the more I can help.
>>
>> The repair "command" is not actually available from the command line yet.
>>  You need to attach to the Riak console to access it.  The APIs are
>> `riak_kv_vnode:repair(PartitionNumber)` and
>> `riak_search_vnode:repair(PartitionNumber)`.
>>
>>
>> On Wed, Jul 18, 2012 at 1:02 PM, Arnaud Wetzel 
>> <arnaud.wet...@gmail.com>wrote:
>>
>>> Ryan,
>>> Increasing "ulimit -n" (current value is 4096, I have tested from 1024
>>> to 200000) does not change anything, always the same errors :
>>> {timeout,range_loop}
>>> lookup/range failure:
>>> {{badfun,#Fun<riak_search_client.9.8393097>},[{mi_server,iterate,6},{mi_server,lookup,8}]}
>>>
>>> I cannot find the command "repair" that you talk about in your email (on
>>> riak1.2.0-rc1), is it a function directly in an erlang module and not
>>> accessible yet with riak-admin ?
>>>
>>>  Thank you very much.
>>>
>>> --
>>> Arnaud Wetzel
>>> KBRW Ad-Venture
>>>  13 rue st Anastase, 75003 Paris
>>>
>>> 2012/7/16 Ryan Zezeski <rzeze...@basho.com>
>>>
>>>> Arnaud,
>>>>
>>>> The 'stream_timeout' and 'emfile' should be correlated.  Whenever you
>>>> see the 'emfile' you should see a corresponding timeout.  The index server
>>>> errors causing the result collector to timeout later.  First, adjust your
>>>> file descriptor limit and then go from there.
>>>>
>>>> For the 1.2 release a "repair" command has been added to rebuild KV or
>>>> index data for a given partition.  In releases before that you must reindex
>>>> all your data.  You don't have to worry about removing the current indexes
>>>> as merge index will garbage collect that for you as it merges.  As I said,
>>>> first I would fix the 'emfile' issue and then see if further action is
>>>> needed.
>>>>
>>>> -Z
>>>>
>>>> P.S. If you want to be absolutely sure what your FD limit is in Riak
>>>> you can `riak attach` and then `os:cmd("ulimit -n").`  Make sure to use
>>>> Ctrl-D to exit from the Riak shell.
>>>>
>>>> On Mon, Jul 16, 2012 at 5:21 AM, Arnaud Wetzel <arnaud.wet...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>> Friday evening one of our riak node has reach his disk space limit
>>>>> during indexing in riak-search. Then after adding some nodes, some 
>>>>> requests
>>>>> fail, and it is impossible to find the correlation between requests with
>>>>> error or those who succeed.
>>>>> The errors are :
>>>>>
>>>>> {{nocatch,stream_timeout},[{riak_search_op_utils,gather_stream_results,4}]}
>>>>> {timeout,range_loop}
>>>>>
>>>>> and sometimes (not always) :
>>>>>
>>>>> {{badmatch,{error,emfile}},[{mi_segment,iterate_by_keyinfo,7},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,'-lookup/8-lc$^1/1-1-',4},{mi_server,lookup,8}]}
>>>>>
>>>>> So anyone else has experienced these errors ? Is it possible that they
>>>>> come from the disk over limit error ? How can I try to repair merge index
>>>>> data ? If it is not possible, what is the good process to delete entirely
>>>>> all the indexes (only indexes, keeping riak datas).
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Regards.
>>>>>
>>>>> --
>>>>> Arnaud Wetzel
>>>>> KBRW Ad-Venture
>>>>> 13 rue st Anastase, 75003 Paris
>>>>>
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users@lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Range Loop Timeout Error after (after disk space over limit)

Reply via email to