Oh, I almost forgot, you can also supply the do_prereduce argument to your reduce phase - this performs a pre-reduce phase on the mapper. This can, depending on the workload, significantly decrease the network overhead between the mappers and the reducer.
--- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop On Mon, Jul 22, 2013 at 10:21 AM, Jeremiah Peschka < jeremiah.pesc...@gmail.com> wrote: > For JavaScript the number of reducers is configured in the app.config file > on each node with the reduce_js_vm_count property. > > --- > Jeremiah Peschka - Founder, Brent Ozar Unlimited > MCITP: SQL Server 2008, MVP > Cloudera Certified Developer for Apache Hadoop > > > On Mon, Jul 22, 2013 at 8:07 AM, Xiaoming Gao <mko...@gmail.com> wrote: > >> Thanks for the clarification, Jeremiah! >> >> One last question: how should I configure the MR job to have multiple >> reducer processes on a single node? >> >> Regards, >> Xiaoming >> >> >> On Mon, Jul 22, 2013 at 1:33 AM, Jeremiah Peschka [via Riak Users] <[hidden >> email] <http://user/SendEmail.jtp?type=node&node=4028486&i=0>> wrote: >> >>> Ah, yeah, I'm mistaken about search partitioning. The docs are correct. >>> >>> I have no idea how the scheduling works. >>> >>> If I had to guess, I would guess that it is a streaming operation. >>> >>> -- >>> Jeremiah Peschka - Founder, Brent Ozar Unlimited >>> MCITP: SQL Server 2008, MVP >>> Cloudera Certified Developer for Apache Hadoop >>> >>> On Jul 21, 2013, at 10:08 PM, Xiaoming Gao <[hidden >>> email]<http://user/SendEmail.jtp?type=node&node=4028477&i=0>> >>> wrote: >>> >>> Thanks a lot, Jeremiah! Your answers really help clarify the issues. >>> >>> Just one more question, by "document-based indices", do you mean >>> document-based partitioning for the indices? Because what I found in the >>> online document >>> http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduceis >>> "Search >>> uses term-based partitioning – also known as a global index." I am not sure >>> if the implementation has changed for the latest version of Riak, but if >>> term-based partitioning is used, does that mean Riak will only schedule the >>> mappers after the whole list of <bucket, key> pair is returned from the >>> index? >>> >>> Thanks, >>> Xiaoming >>> >>> >>> On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <[hidden >>> email] <http://user/SendEmail.jtp?type=node&node=4028476&i=0>> wrote: >>> >>>> Responses inline. Hopefully they shed some light on the subject. >>>> >>>> --- >>>> Jeremiah Peschka - Founder, Brent Ozar Unlimited >>>> MCITP: SQL Server 2008, MVP >>>> Cloudera Certified Developer for Apache Hadoop >>>> >>>> >>>> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden >>>> email]<http://user/SendEmail.jtp?type=node&node=4028474&i=0> >>>> > wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I am trying to learn about Riak MapReduce and comparing it with Hadoop >>>>> MapReduce, and there are some details that I am interested in but not >>>>> covered in the online documents. So hopefully we can get some help here >>>>> about the following questions? Thanks in advance! >>>>> >>>> >>>> They're not at all similar. Hadoop MR is optimized for sequential data >>>> processing in large batches. Riak MR works better when you think of it like >>>> a multi-processing engine - you can perform work across a matching set of >>>> items and that work will be distributed across the cluster during map >>>> phases. >>>> >>>> Take a look at this thread for a bit of discussion about when you >>>> should use Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v >>>> >>>> Or, if you want to, you can run a Riak MR job across an entire bucket, >>>> which really is like scanning every table in an RDBMS while looking for >>>> rows from a single table. MR jobs run with an R of 1. So, at least there's >>>> that. >>>> >>>> >>>>> 1. For a given MapReduce request (or to say, job), how does Riak >>>>> decide how >>>>> many mappers to use for the job? For example, if I have 8 nodes and my >>>>> data >>>>> are distributed across all nodes with an "N" value of 2, will I have 4 >>>>> mappers running on 4 nodes concurrently? Is it possible to have >>>>> multiple >>>>> mappers (e.g., 4 or even 6) for the same MR job running on each node >>>>> (for >>>>> better processing speed)? >>>>> >>>> >>>> To the best of my recollection, this will be based on either: >>>> >>>> 1) If you're using JavaScript MR jobs, the number of mappers and >>>> reducers is controlled by the the map_js_vm_count and reduce_js_vm_count >>>> settings from each node's app.config file. >>>> 2) If you're using Erlang: magic. This will be handled by the Erlang VM >>>> and is based on number of processors and your overall Erlang VM >>>> configuration. >>>> >>>> >>>>> >>>>> 2. If I run a MapReduce job over the results of a Riak Search query, >>>>> how >>>>> does Riak schedule the mappers based on the search results? >>>>> >>>> >>>> Riak Search uses document-based indices - search will query every node >>>> in the cluster. Map phases happen and then results are then streamed to the >>>> reducer. >>>> >>>> >>>>> >>>>> 3. How does Riak handle intermediate data generated by mappers? >>>>> Specifically: >>>>> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, >>>>> and >>>>> the output from all mappers are first grouped based on keys, and then >>>>> handed >>>>> over to the reducer. Does Riak do similar grouping of intermediate >>>>> data? >>>>> >>>> >>>> The only reason for the intermediate grouping/scratch work in Hadoop MR >>>> jobs is to deal with multiple reducers. Although, I'm not entirely sure how >>>> this works in Riak, my suspicion is that data is streamed across the wire >>>> after the data is read from disk. >>>> >>>> >>>>> >>>>> (2) How are mapper outputs transmitted to the reducer? Does Riak use >>>>> local >>>>> disks on the mapper nodes or reducer nodes to store the intermediate >>>>> data >>>>> temporarily? >>>> >>>> >>>> Since large MR jobs can cause out of memory errors, you can bet good >>>> money that the answer is "no". >>>> >>>> >>>>> >>>>> 4. According to the document >>>>> >>>>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work, >>>>> each MR job only schedules one reducer, which runs on the coordinate >>>>> node. >>>>> Is there any way to configure a MR job to use multiple reducers? >>>>> >>>> >>>> Using Riak MR, there's no way to create a job that runs reducers on >>>> multiple nodes. You can have multiple reducer processes on a single node, >>>> but not reducers on multiple nodes. >>>> >>>> >>>>> >>>>> Best regards, >>>>> Xiaoming >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html >>>>> Sent from the Riak Users mailing list archive at Nabble.com. >>>>> >>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=1> >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=2> >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>>> ------------------------------ >>>> If you reply to this email, your message will be added to the >>>> discussion below: >>>> >>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028474.html >>>> To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click >>>> here. >>>> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>>> >>> >>> >>> ------------------------------ >>> View this message in context: Re: Comparing Riak MapReduce and Hadoop >>> MapReduce<http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028476.html> >>> >>> Sent from the Riak Users mailing list >>> archive<http://riak-users.197444.n3.nabble.com/>at >>> Nabble.com. >>> >>> _______________________________________________ >>> riak-users mailing list >>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028477&i=1> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028477&i=2> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the >>> discussion below: >>> >>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028477.html >>> To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click >>> here. >>> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> ------------------------------ >> View this message in context: Re: Comparing Riak MapReduce and Hadoop >> MapReduce<http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028486.html> >> Sent from the Riak Users mailing list >> archive<http://riak-users.197444.n3.nabble.com/>at Nabble.com. >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com