Re: vnode_proxy_timeout during mapreduce

Christian Dahlqvist Mon, 18 Nov 2013 13:13:23 -0800

Hi Jason,

Processing all records in a large bucket will cause a lot of data to be read, 
possibly from disk, and can therefore be slow. Having said that, there are a 
number of things you can do to make your job more efficient.

The first thing is to increase the batch size for the reduce phase. This is by 
default set to 20, and as it needs to recursively iterate process all inputs, 
this means a lot of iterations for 16 million records. You can specify this 
parameter as in the example below:

curl -v -d '{"inputs":"mybucket",
            "timeout": 86400000,
            "query":[
              {"map":{
                "language":"erlang",
                "module":"riak_kv_mapreduce",
                "function":"map_identity"}
              },
              {"reduce":{
                 "language":"erlang",
                  "module":"riak_kv_mapreduce",
                  "function":"reduce_count_inputs",
                  "arg":{"reduce_phase_batch_size":1000}}
              }
            ]}' -H "Content-Type: application/json" http://riak01:8098/mapred

Normally the reduce phase function is run only on the coordinating node, which 
means that all data will need to be transferred there before it can get 
processed. It is however possible to enable the first iteration of the reduce 
phase to run on the server where the data is located. This can be done through 
either an argument sent to the map phase preceding the reduce phase or by 
adding {mapred_always_prereduce, true}, to the riak_kv section of the 
app.config file (for all nodes in the cluster).

I hope this helps improve performance.

Best regards,

Christian

[0] http://docs.basho.com/riak/latest/ops/advanced/configs/mapreduce/

On 18 Nov 2013, at 20:42, Jason Strutz <ja...@cumuluscode.com> wrote:

> I have a riak bucket which contains roughly 16 million records. I'm trying to 
> run a simple count over all the keys in the bucket:
> 
> curl -v -d '{"inputs":"mybucket",
>             "timeout": 86400000,
>             "query":[
>               {"map":{
>                 "language":"erlang",
>                 "module":"riak_kv_mapreduce",
>                 "function":"map_identity"}
>               },
>               {"reduce":{
>                  "language":"erlang",
>                   "module":"riak_kv_mapreduce",
>                   "function":"reduce_count_inputs"}
>               }
>             ]}' -H "Content-Type: application/json" http://riak01:8098/mapred
> 
> However, I receive the following error, after a few minutes of spinning:
> {"phase":0,"error":"[{vnode_proxy_timeout,{2... <<truncated for clarity, rest 
> at https://gist.github.com/jstrutz/b89efb6de825255135fe >>
> 
> I realize mapping over all the keys in a bucket is not ideal, but I would 
> like to be able to do so in a pinch, even if it means tweaking timeouts and 
> such.  I'm running riak 1.4.2.
> 
> Thanks,
> -Jason
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: vnode_proxy_timeout during mapreduce

Reply via email to