Hi, Julien.

On Sat, Jun 1, 2013 at 5:27 PM, Julien Genestoux
<julien.genest...@gmail.com> wrote:
> Yet, due to a bug in our implementation, we have 'lost' some entries. In
> other words, some feedKey-entryKey elements are not in any feed object.
…
> Our initial solution was to list all the feed keys, and then, for each,

Is it possible that there are feedKey-entryKey objects for which there
is no feed object? The problem as you described it made it sound like
the feed object always exists, but may just be missing an entry. I ask
if the feed object might be missing entirely, because if it is then
the initial solution you describe (listing all feed keys) won't work,
regardless of speed, because it won't find some of the entry key
prefixes. If this is the case, you have no choice but to list all
entry keys.

> We're now thinking there may be a better way? Maybe with a single mapReduce
> job which would iterate over all the entry keys and then only keep track
> of the feedKey that have more than 10 elements? This would probably cut down
> very significantly the number of map reduce as we would run them only
> on the few (maybe 1%?) feedKey for which there are 'lost' entries?
>
> Maybe there would be a better way? Any idea?

I might suggest removing MapReduce from the equation entirely, and
listing keys straight to the client for processing. Trying to find
anything with "more than X instances" in a Riak MapReduce is a
difficult task, because you will have to build the entire result set
on one node. There is no way to trim it down as work progresses,
because you can't know whether or not you have seen all entries for a
feed until you have seen all entries, period. Thus, ignoring feeds
with 10 or less elements can't be done until the end of processing. If
the total number of feed objects is small, this may be possible, but
if not, then managing the large result set will be tricky at best (due
to timeout/retry/etc.), and impossible with a JS reduce phase at least
(because of the time required to transfer the encoded data out to
spidermonkey and back).

Streaming all keys to a client is also expensive, but managing retries
after timeout, or bugs in sorting/filtering logic will be much simpler
since you won't have to worry about hammering the Riak cluster. You
can sort and resort that list locally, check the idea for feedKeys
with more than 10 elements, and compare it to other plans before
committing to additional cluster time.

In addition, if you're using the eleveldb backend, then the next
release of Riak will bring the ability to paginate 2i results. So, you
could make streaming all keys to a client less punishing by requesting
just a few keys at a time from the '$bucket' index. This capability is
committed on our master branches, linked from
https://github.com/basho/riak_kv/pull/540

HTH,
Bryan

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to