Responses inline. Hopefully they shed some light on the subject. --- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop
On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <mko...@gmail.com> wrote: > Hi everyone, > > I am trying to learn about Riak MapReduce and comparing it with Hadoop > MapReduce, and there are some details that I am interested in but not > covered in the online documents. So hopefully we can get some help here > about the following questions? Thanks in advance! > They're not at all similar. Hadoop MR is optimized for sequential data processing in large batches. Riak MR works better when you think of it like a multi-processing engine - you can perform work across a matching set of items and that work will be distributed across the cluster during map phases. Take a look at this thread for a bit of discussion about when you should use Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v Or, if you want to, you can run a Riak MR job across an entire bucket, which really is like scanning every table in an RDBMS while looking for rows from a single table. MR jobs run with an R of 1. So, at least there's that. > 1. For a given MapReduce request (or to say, job), how does Riak decide how > many mappers to use for the job? For example, if I have 8 nodes and my data > are distributed across all nodes with an "N" value of 2, will I have 4 > mappers running on 4 nodes concurrently? Is it possible to have multiple > mappers (e.g., 4 or even 6) for the same MR job running on each node (for > better processing speed)? > To the best of my recollection, this will be based on either: 1) If you're using JavaScript MR jobs, the number of mappers and reducers is controlled by the the map_js_vm_count and reduce_js_vm_count settings from each node's app.config file. 2) If you're using Erlang: magic. This will be handled by the Erlang VM and is based on number of processors and your overall Erlang VM configuration. > > 2. If I run a MapReduce job over the results of a Riak Search query, how > does Riak schedule the mappers based on the search results? > Riak Search uses document-based indices - search will query every node in the cluster. Map phases happen and then results are then streamed to the reducer. > > 3. How does Riak handle intermediate data generated by mappers? > Specifically: > (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and > the output from all mappers are first grouped based on keys, and then > handed > over to the reducer. Does Riak do similar grouping of intermediate data? > The only reason for the intermediate grouping/scratch work in Hadoop MR jobs is to deal with multiple reducers. Although, I'm not entirely sure how this works in Riak, my suspicion is that data is streamed across the wire after the data is read from disk. > > (2) How are mapper outputs transmitted to the reducer? Does Riak use local > disks on the mapper nodes or reducer nodes to store the intermediate data > temporarily? Since large MR jobs can cause out of memory errors, you can bet good money that the answer is "no". > > 4. According to the document > http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work, > each MR job only schedules one reducer, which runs on the coordinate node. > Is there any way to configure a MR job to use multiple reducers? > Using Riak MR, there's no way to create a job that runs reducers on multiple nodes. You can have multiple reducer processes on a single node, but not reducers on multiple nodes. > > Best regards, > Xiaoming > > > > -- > View this message in context: > http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html > Sent from the Riak Users mailing list archive at Nabble.com. > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com