On Fri, Oct 7, 2011 at 1:50 AM, Fyodor Yarochkin <fyodo...@armorize.com> wrote:
> Here's one of the queries that consistently generates series of
> 'fitting_died' log messages:
>
> {
>   "inputs":{
>       "bucket":"test",
>       "index":"integer_int",
…
>   },
>   "query":[
>    {"map":{"language":"javascript",
…
>    },
>    {"reduce":{"language":"javascript",
…
>  {"reduce":{"language":"javascript",
…
>    ],"timeout": 9000
> }
>
> produces over hundred of " "Supervisor riak_pipe_vnode_worker_sup had
> child at module undefined at <0.28835.0> exit with reason fitting_died
> in context child_terminated" entries in log file and returns 'timeout'

My interpretation of your report is that 9 seconds is not long enough
to finish your MapReduce query.  I'll explain how I arrived at this
interpretation:

The log message you're seeing says that many processes that
riak_pipe_vnode_worker_sup was monitor exited abnormally.  That
supervisor only monitors Riak Pipe worker processes, the processes
that do the work for Riak 1.0's MapReduce phases.

The reason those workers gave for exiting abnormally was
'fitting_died'.  This means that the pipeline they were working for
closed before they were finished with their work.

The result your received was 'timeout'.  The way timeouts work in
Riak-Pipe-based MapReduce is that a timer triggers a message at the
given time, causing a monitoring process to cease waiting for results,
tear down the pipe, and return a timeout message to your client.

The "tear down the pipe" step in the timeout process is what causes
all of those 'fitting_died' message you see.  They're normal, and are
intended to aid in analysis like the above.

With that behind us, though, the question remains: why isn't 9 seconds
long enough to finish this query?  To figure that out, I'd start from
the beginning:

1. Is 9 seconds long enough to just finish the index query (using the
index API outside of MapReduce)?  If not, then the next people to jump
in with help here will want to know more about the types, sizes, and
counts of data you have indexed.

2. Assuming the bare index query finishes fast enough, is 9 seconds
long enough to get through just the index and map phase (no reduce
phases)?  If not, it's likely that either it takes longer than 9
seconds to pull every object matching your index query out of KV, or
that contention for Javascript VMs prohibits the throughput needed.

2a. Try switching to an Erlang map phase.
{"language":"erlang","module":"riak_kv_mapreduce","function":"map_object_value","arg":"filter_notfound"}
should do exactly what your Javascript function does, without
contending for a JS VM.

2b. Try increasing the number of JS VMs available for map phases.  In
your app.config, find the 'map_js_vm_count' setting, and increase it.

3. Assuming just the map phase also makes it through, is 9 seconds
long enough to get through just the index, map, and first reduce phase
(leave off the second)?  Your first reduce phase looks like it doesn't
do anything … is it needed?  Try removing it.

4. If you get all the way to the final phase before hitting the 9
second timeout, then it's may be that the re-reduce behavior of Riak
KV's MapReduce causes your function to be too expensive.  This will be
especially true if you expect that phase to receive thousands of
inputs.  A sort function such as yours probably doesn't benefit from
re-reduce, so I would recommend disabling it by adding
"arg":{"reduce_phase_only_1":true} to that reduce phase's
specification.  With that in place, your function should be evaluated
only once, with all the inputs it will receive.  This may still fail
because of the time it can take to encode/decode a large set of
inputs/outputs to/from JSON, but doing it only once may be enough to
get you finished.

Hope that helps,
Bryan

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to