Re: Search failure: stream_timeout

Ryan Zezeski Thu, 27 Oct 2011 13:06:46 -0700

On Fri, Oct 21, 2011 at 5:44 PM, Elias Levy <fearsome.lucid...@gmail.com>wrote:

> On Fri, Oct 21, 2011 at 2:23 PM, Elias Levy 
> <fearsome.lucid...@gmail.com>wrote:
>
>> I found that if I limited the timestamps to a range that covers a
>> reasonable number of records the query succeeds.  But if the query is of the
>> form 'ts:[0 TO 1319228408]', then Riak generates that error and the client
>> connection it shutdown.  I am guessing that that queries covers too many
>> records, which is causing the nodes to take longer than expected to respond,
>> and that some timeout is being reached and Riak kills the query.  Is that
>> correct?
>>
>
> I should have probably mentioned that the are other terms in the query that
> limit the results.  I am now wondering if this is caused by the fact that
> Riak Search is sharded by term, rather than document, causing it to search
> for each term in the query independently and then creating an intersection
> of matches to return as the query result.
>
> If that is the case, then a single query term that select a large portion
> of the index will cause trouble, even if other terms limit the results, as
> the system will need return a good portion of the keys in the bucket, before
> they can be whittled down by other query terms.
>
> If so, it would seem the only solution is to break up the query into
> smaller, more manageable chunks and aggregate them on the client side.  Is
> this correct?
>
>
Elias,

This is indeed the case.  Even in the case of an intersection, Search will
run all sub-queries to completion and then combine them at a coordinator
based on the query plan.  If any of the sub-queries returns a large number
of results then latency will start to suffer and timeouts may occur.
 However, all hope is not lost.  Search has a notion of "inline fields."
 These fields are analyzed just like any other field but unlike a normal
field the results are stored alongside _every_ term entry in the index for
efficient access.  For example if you have an object like the following you
could index the `bio` field as usual but index the `born` field as an inline
field (and use the no-op analyzer).  For each term entry generated from the
`bio` field Search will then store a copy of the `born` field _inside_ the
index.

{"name":"Ryan Zezeski",
 "born":"1983-03-17",
 "gender":"male",
 "bio":"Once upon a time, a child was begot between an Irish/Polish man and
a Welsh/Italian woman..."}

During query time Search can filter based on this field at the index level
rather then merging the results at the coordinator level.  This can save
network, memory, and time overhead.  The price you pay is more disk space.
 It also requires that you have one field indexed normally and use it during
query time.  For this example you could maybe use name as the query field
but something like "name:Ryan" could still come back with a lot of results.

Here are my general guidelines for inline fields:

1) You have a field that you always query that should return a minimal
amount of results.  Minimal being the least amount you can get away with.
 E.g. querying `bio` fits this description.

2) You have a field that you want to filter on but could potentially match a
large subset of the index.  E.g. `born` and `gender` match that category.

3) You have extra disk space to dole out which will be proportional to the
size of your inline field.

If you don't have #1 then I would say it's a sign you probably want to use
secondary indices because #2 is basically a form of tagging and that's what
secondary indices were built for.

To learn more about inline fields chekout
https://github.com/rzezeski/try-try-try/tree/master/2011/riak-search-inline-fields

-Ryan

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Search failure: stream_timeout

Reply via email to