Hi Jason, Processing all records in a large bucket will cause a lot of data to be read, possibly from disk, and can therefore be slow. Having said that, there are a number of things you can do to make your job more efficient.
The first thing is to increase the batch size for the reduce phase. This is by default set to 20, and as it needs to recursively iterate process all inputs, this means a lot of iterations for 16 million records. You can specify this parameter as in the example below: curl -v -d '{"inputs":"mybucket", "timeout": 86400000, "query":[ {"map":{ "language":"erlang", "module":"riak_kv_mapreduce", "function":"map_identity"} }, {"reduce":{ "language":"erlang", "module":"riak_kv_mapreduce", "function":"reduce_count_inputs", "arg":{"reduce_phase_batch_size":1000}} } ]}' -H "Content-Type: application/json" http://riak01:8098/mapred Normally the reduce phase function is run only on the coordinating node, which means that all data will need to be transferred there before it can get processed. It is however possible to enable the first iteration of the reduce phase to run on the server where the data is located. This can be done through either an argument sent to the map phase preceding the reduce phase or by adding {mapred_always_prereduce, true}, to the riak_kv section of the app.config file (for all nodes in the cluster). I hope this helps improve performance. Best regards, Christian [0] http://docs.basho.com/riak/latest/ops/advanced/configs/mapreduce/ On 18 Nov 2013, at 20:42, Jason Strutz <ja...@cumuluscode.com> wrote: > I have a riak bucket which contains roughly 16 million records. I'm trying to > run a simple count over all the keys in the bucket: > > curl -v -d '{"inputs":"mybucket", > "timeout": 86400000, > "query":[ > {"map":{ > "language":"erlang", > "module":"riak_kv_mapreduce", > "function":"map_identity"} > }, > {"reduce":{ > "language":"erlang", > "module":"riak_kv_mapreduce", > "function":"reduce_count_inputs"} > } > ]}' -H "Content-Type: application/json" http://riak01:8098/mapred > > However, I receive the following error, after a few minutes of spinning: > {"phase":0,"error":"[{vnode_proxy_timeout,{2... <<truncated for clarity, rest > at https://gist.github.com/jstrutz/b89efb6de825255135fe >> > > I realize mapping over all the keys in a bucket is not ideal, but I would > like to be able to do so in a pinch, even if it means tweaking timeouts and > such. I'm running riak 1.4.2. > > Thanks, > -Jason > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com