Ryan, Thanks for the detailed reply.
We're a node.js shop and we're hitting riak via the riak-js library, which AFAIK just uses the http client. We haven't actually scaled out to the point where we're seeing huge #s of results returned, but if we extrapolate from the numbers we are seeing, it's clear that we'll start to hit these limits. We have a lot of #2 type queries because we support full text search, and unfortunately our "text" tends to be a lot longer than 140 characters, so I really don't think inlining it is practical. -- Greg Clipboard <http://www.clipboard.com/> is hiring<http://www.clipboard.com/jobs> ! On Fri, Jul 15, 2011 at 11:49 AM, Ryan Zezeski <rzeze...@basho.com> wrote: > Greg, > > I'm curious to know how you are querying search (i.e. which client/API) and > what your setup looks like (i.e. # of nodes, disk, network between them, > etc). How many of these #2 query types are you seeing and what's the > average # of results being returned? Is your search instance opened up to > user queries? > > Ignoring any other inefficiencies in the Search code I'd say that there are > two main points working against you in query #2 > > 1.) Search uses a _global index_ [1] which means that, as Rusty would say, > it's partitioned by term (as opposed to a _local index_ which would > partition by document). It has been found, in cases where the query > contains less terms than the processor count, that a global index scales out > better as it has better concurrency characteristics [1]. However, a local > index can balance load better and in the case of #2 could possibly drop > latency times. When you perform that "text:guitar AND date:07/14/11" each > term is queried by one, and only one, vnode. These results are then > combined by a coordinator (called a broker in [1]). How much using a local > index would change your observed latency is unknown to me, but it would have > other effects that may offset any latency drops. > > 2.) Search is not limiting the results returned. Instead of sorting the > results based on relevance and only taking the top N from each query term > it's getting all data from each query, passing it to the coordinator and > saying "hey, you figure this out." When you run that #2 query, as you said, > you are bounded by your largest result set because that's the only way for > search to know it's correctly answered the query. Said another way, you may > only have 10 entries for the 14th but you still need to search all 100K > guitar entries to determine which ones fall on that date. Performing some > sort of relevance calculation beforehand and limiting the result > would certainly help but it also means you won't get the full result set. I > don't think there's any way to have your cake and eat it too unless you can > afford to put everything in memory. > > At this point in time I think inline fields will be your best bet. Inline > fields help dramatically here because you are essentially bringing part of > your document into each term entry in the index. This means you can query > on high cardinality terms and filter on lower ones, i.e. low latency (high > and low being relative here). If you're worried about disk space then you > can make use of the `only` value for the inline attribute. This tells > Search to store this field's value inline but _don't_ index it. If you're > only using the `text` field to filter results then this is exactly what you > should do. In fact, I would recommend you do that because any search > against the `text` field for a corpus of tweets is probably going to have a > large result set. > > > HTH, > -Ryan > > [1]: C.S. Badue. Distributed query processing using partitioned inverted > files. Master's thesis, Federal University of Minas Gerais, Belo Horizonte, > Minas Gerias, Brazil, March 2001. > > On Thu, Jul 14, 2011 at 6:20 PM, Greg Pascale <g...@clipboard.com> wrote: > >> Hi Ryan, >> >> Yes we are using 14.2. >> >> I think I get what you are saying about inline fields. It looks like it >> will fix some of our problems, but not all of them. If you'll indulge me in >> a little contrived example, I think I can explain what I mean. >> >> Let's say I'm implementing a simple twitter clone. In my system, tweets >> have a user, text, date and can be either public or private - so one might >> look like >> >> { >> username: greg, >> public: true, >> text: "I bought a new guitar!", >> date: 07/14/11 >> } >> >> Let's say that my service is popular - I have 100k users, each of whom has >> published exactly 100 tweets, and exactly half of the tweets are public, the >> rest are private. So I have 10 million tweets total and 5 million of them >> are public. >> >> Query 1: "username:greg AND public:false" - this finds all tweets by greg >> that are private. In this case "username:greg" matches 100 results and >> "public:false" matches 5 million. My tests have shown that the performance >> of this compound query will be roughly equivalent to the worse of the two >> ("public:false") - so this query will be way too slow. >> However, since "public" values are nice and small, I can inline it for a >> big win. >> >> Query 2: "text:guitar AND date:07/14/11" - this finds all tweets from >> today that contain "guitar". Suppose there are 100k tweets that contain >> "guitar", but it's early in the day so there have only been 1k tweets in >> total on 07/14/11. My result set is therefore no bigger than 1k (probably >> much smaller unless all my users bought new guitars today), but this query >> is still bounded by the "text:guitar" piece which matches 100k results. In >> this case, inlining date wouldn't help because it's not the slow part, and >> indexing on the text field isn't practical - it's too big. >> >> If you could follow that, am I understanding this correctly? In our >> system, we have some queries like #1 above, but since we support full text >> search, many like #2 as well. Do you have any suggestions for what we could >> do to make queries like #2 performant? >> >> Thanks, >> -Greg >> >> >> >> On Wed, Jul 13, 2011 at 7:19 PM, Ryan Zezeski <rzeze...@basho.com> wrote: >> >>> Greg, >>> >>> I'm assuming you are using 14.2. >>> >>> 1) There is a bug in 14.2 that will cause a (potentially very fast >>> growing) memory leak when using AND. This is unfortunate, sorry. The good >>> news I have since patched it [1]. >>> >>> 2) This is your best course of action, and you were so close but you've >>> actually crossed your fields. That is, the inline field should be the one >>> that contains the more common term (i.e. the 'text' field). So you should >>> perform a range query on your date with a filter on the text inline field. >>> Obviously, the more terms in this field the more the index will inflate >>> (space-wise), but if you can live with that then it should reduce your >>> latency substantially (famous last words). Please try this and get back to >>> me. >>> >>> 3) That is a very well written article, props to the author. However, I >>> would leave this as a last resort. Try what I mentioned in #2, and if >>> that's not enough to get you by then let's brainstorm. >>> >>> [1]: >>> https://github.com/basho/riak_search/commit/cd910c2519f94e9d7e8a8e21894db9d0eecdd5b4 >>> >>> >>> On Wed, Jul 6, 2011 at 2:43 PM, Greg Pascale <g...@clipboard.com> wrote: >>> >>>> Hi, >>>> >>>> I'm looking at ways to improve riaksearch queries that produce a lot of >>>> matches. >>>> >>>> In my use case, I only ever want the top 20 results for any query, and >>>> results should be ordered by date (which is encoded in the key). For >>>> searches with few matches (numFound < ~1000), performance is great. For >>>> searches with more matches (numFound > ~10000), performance starts to lag >>>> even though I only ever want the top 20. I assume this is because the >>>> system >>>> needs to fetch and sort all of the results to know what the top 20 are, but >>>> I'm hoping I can exploit the constraints of my use case in some way to >>>> increase performance. I've looked at the following approaches. >>>> >>>> 1) AND the "text:" term with a small date range (e.g. text:<common word> >>>> AND date:[<yesterday to today>]). This reduces the result set, but >>>> performance does not improve. At best, the performance is as good as simply >>>> doing the "text:<common word>" search without the date range, and in some >>>> cases worse. >>>> >>>> 2) Same as above, but make the date an inline field. From what I could >>>> find on the topic, it sounded like this is exactly what inline fields or >>>> for, but I was disappointed to discover it performed far worse than even >>>> the >>>> compound query above. >>>> >>>> 3) In this article <http://blog.inagist.com/searching-with-riaksearch>, >>>> which I was linked to from somewhere on the basho site, the author >>>> describes >>>> a technique in which he calls search_fold directly and stops after he's >>>> received enough results. He claims this is possible in his case because >>>> results are returned in key order, and he's chosen his keys to match the >>>> desired ordering of his results. My keys have the same property, as I'm >>>> already using the presort=key option. Is this behavior of search_fold a >>>> lucky side-effect, or is this actually guaranteed to work? >>>> >>>> Am I simply expecting too much of riaksearch here, or is there a way to >>>> make this work? If all else fails, I suppose I could divide my data into >>>> more buckets, but I'm hoping to avoid that as it would make querying much >>>> more complex. >>>> >>>> Thanks, >>>> -Greg >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com