Hi OJ,

The do_prereduce parameter makes it possible to have the first iteration of the 
reduce phase execute where the preceding map phase generated output. This can, 
as in the example I provided, be used to reduce the amount of data that needs 
to be sent across the network. This is described in greater detail here: 
http://docs.basho.com/riak/latest/references/appendices/MapReduce-Implementation/

As it is possible to set it to be enabled by default in the app.config, it 
should be fine to always specify it for reduce phases preceded by a map phase. 

Best regards,

Christian


On 14 Feb 2013, at 12:21, OJ Reeves <o...@buffered.io> wrote:

> Chris,
> 
> I've never heard of do_prereduce before. What kind of effect does this have? 
> That is, if someone were to use it all the time, regardless of the amount of 
> data being returned, would this be a bad thing?
> 
> Thanks.
> OJ
> 
> On Thu, Feb 14, 2013 at 6:19 PM, Christian Dahlqvist <christ...@basho.com> 
> wrote:
> Hi,
> 
> For buckets with a significant number of records, it makes a lot of sense to 
> run the example I provided with 'do_prereduce' enabled as it will result in 
> considerably less data being sent between the nodes. This can be enabled as 
> follows:
> 
> curl -XPOST http://localhost:8098/mapred 
>   -H 'Content-Type: application/json' 
>   -d '{"inputs":{
>            "bucket":"goog",
>            "index":"$bucket",
>            "key":"goog"
>        },
>        "query":[{"reduce":{"language":"erlang",
>                            "module":"riak_kv_mapreduce",
>                            "function":"reduce_count_inputs", 
>                            "arg":{"do_prereduce":true}}}]}'
> 
> Best regards,
> 
> Christian
> 
> 
> On 14 Feb 2013, at 08:01, Christian Dahlqvist <christ...@basho.com> wrote:
> 
>> Hi Jeremiah,
>> 
>> It does indeed not seem to be documented on the main docs site, and I will 
>> try to correct this. The only place I have found it described is on the wiki 
>> for the Ruby client 
>> (https://github.com/basho/riak-ruby-client/wiki/Secondary-Indexes).
>>  
>> Below is also an example of a simple mapreduce job that shows how to count 
>> the number of records in the 'goog' bucket based on the $bucket secondary 
>> index:
>> 
>> curl -XPOST http://localhost:8098/mapred 
>>   -H 'Content-Type: application/json' 
>>   -d '{"inputs":{
>>            "bucket":"goof",
>>            "index":"$bucket",
>>            "key":"goof"
>>        },
>>        "query":[{"reduce":{"language":"erlang",
>>                            "module":"riak_kv_mapreduce",
>>                            "function":"reduce_count_inputs"}}]}'
>> 
>> I hope this helps.
>> 
>> Best regards,
>> 
>> Christian
>> 
>> 
>> On 13 Feb 2013, at 18:12, Jeremiah Peschka <jeremiah.pesc...@gmail.com> 
>> wrote:
>> 
>>> Is this documented anywhere on the docs.basho.com site? 
>>> 
>>> Searching for $bucket produces search results just for "bucket" and Google 
>>> says "No results found for site:docs.basho.com $bucket."
>>> 
>>> ---
>>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>>> MCITP: SQL Server 2008, MVP
>>> Cloudera Certified Developer for Apache Hadoop
>>> 
>>> 
>>> On Wed, Feb 13, 2013 at 10:08 AM, Christian Dahlqvist <christ...@basho.com> 
>>> wrote:
>>> Hi,
>>> 
>>> In addition to the $key index, there is also a $bucket index available by 
>>> default. This contains the name of the bucket, and can be used to get all 
>>> keys in a specific bucket.
>>> 
>>> Best regards,
>>> 
>>> Christian
>>> 
>> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> 
> -- 
> 
> OJ Reeves
> +61 431 952 586
> http://buffered.io/

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to