Hi, Julien. On Sat, Jun 1, 2013 at 5:27 PM, Julien Genestoux <julien.genest...@gmail.com> wrote: > Yet, due to a bug in our implementation, we have 'lost' some entries. In > other words, some feedKey-entryKey elements are not in any feed object. … > Our initial solution was to list all the feed keys, and then, for each,
Is it possible that there are feedKey-entryKey objects for which there is no feed object? The problem as you described it made it sound like the feed object always exists, but may just be missing an entry. I ask if the feed object might be missing entirely, because if it is then the initial solution you describe (listing all feed keys) won't work, regardless of speed, because it won't find some of the entry key prefixes. If this is the case, you have no choice but to list all entry keys. > We're now thinking there may be a better way? Maybe with a single mapReduce > job which would iterate over all the entry keys and then only keep track > of the feedKey that have more than 10 elements? This would probably cut down > very significantly the number of map reduce as we would run them only > on the few (maybe 1%?) feedKey for which there are 'lost' entries? > > Maybe there would be a better way? Any idea? I might suggest removing MapReduce from the equation entirely, and listing keys straight to the client for processing. Trying to find anything with "more than X instances" in a Riak MapReduce is a difficult task, because you will have to build the entire result set on one node. There is no way to trim it down as work progresses, because you can't know whether or not you have seen all entries for a feed until you have seen all entries, period. Thus, ignoring feeds with 10 or less elements can't be done until the end of processing. If the total number of feed objects is small, this may be possible, but if not, then managing the large result set will be tricky at best (due to timeout/retry/etc.), and impossible with a JS reduce phase at least (because of the time required to transfer the encoded data out to spidermonkey and back). Streaming all keys to a client is also expensive, but managing retries after timeout, or bugs in sorting/filtering logic will be much simpler since you won't have to worry about hammering the Riak cluster. You can sort and resort that list locally, check the idea for feedKeys with more than 10 elements, and compare it to other plans before committing to additional cluster time. In addition, if you're using the eleveldb backend, then the next release of Riak will bring the ability to paginate 2i results. So, you could make streaming all keys to a client less punishing by requesting just a few keys at a time from the '$bucket' index. This capability is committed on our master branches, linked from https://github.com/basho/riak_kv/pull/540 HTH, Bryan _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com