Re: Comparing Riak MapReduce and Hadoop MapReduce

Jeremiah Peschka Sun, 21 Jul 2013 22:33:06 -0700

Ah, yeah, I'm mistaken about search partitioning. The docs are correct.

I have no idea how the scheduling works.


If I had to guess, I would guess that it is a streaming operation.

--
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop

On Jul 21, 2013, at 10:08 PM, Xiaoming Gao <mko...@gmail.com> wrote:

> Thanks a lot, Jeremiah! Your answers really help clarify the issues. 
> 
> Just one more question, by "document-based indices", do you mean 
> document-based partitioning for the indices? Because what I found in the 
> online document 
> http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduce
>  is "Search uses term-based partitioning – also known as a global index." I 
> am not sure if the implementation has changed for the latest version of Riak, 
> but if term-based partitioning is used, does that mean Riak will only 
> schedule the mappers after the whole list of <bucket, key> pair is returned 
> from the index?
> 
> Thanks,
> Xiaoming
> 
> 
> On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <[hidden 
> email]> wrote:
>> Responses inline. Hopefully they shed some light on the subject.
>> 
>> ---
>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>> MCITP: SQL Server 2008, MVP
>> Cloudera Certified Developer for Apache Hadoop
>> 
>> 
>> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden email]> wrote:
>>> Hi everyone,
>>> 
>>> I am trying to learn about Riak MapReduce and comparing it with Hadoop
>>> MapReduce, and there are some details that I am interested in but not
>>> covered in the online documents. So hopefully we can get some help here
>>> about the following questions? Thanks in advance!
>> 
>> They're not at all similar. Hadoop MR is optimized for sequential data 
>> processing in large batches. Riak MR works better when you think of it like 
>> a multi-processing engine - you can perform work across a matching set of 
>> items and that work will be distributed across the cluster during map 
>> phases. 
>> 
>> Take a look at this thread for a bit of discussion about when you should use 
>> Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v
>> 
>> Or, if you want to, you can run a Riak MR job across an entire bucket, which 
>> really is like scanning every table in an RDBMS while looking for rows from 
>> a single table. MR jobs run with an R of 1. So, at least there's that.
>> 
>>> 
>>> 1. For a given MapReduce request (or to say, job), how does Riak decide how
>>> many mappers to use for the job? For example, if I have 8 nodes and my data
>>> are distributed across all nodes with an "N" value of 2, will I have 4
>>> mappers running on 4 nodes concurrently? Is it possible to have multiple
>>> mappers (e.g., 4 or even 6) for the same MR job running on each node (for
>>> better processing speed)?
>> 
>> To the best of my recollection, this will be based on either:
>>  
>> 1) If you're using JavaScript MR jobs, the number of mappers and reducers is 
>> controlled by the the map_js_vm_count and reduce_js_vm_count settings from 
>> each node's app.config file.
>> 2) If you're using Erlang: magic. This will be handled by the Erlang VM and 
>> is based on number of processors and your overall Erlang VM configuration.
>>  
>>> 
>>> 2. If I run a MapReduce job over the results of a Riak Search query, how
>>> does Riak schedule the mappers based on the search results?
>> 
>> Riak Search uses document-based indices - search will query every node in 
>> the cluster. Map phases happen and then results are then streamed to the 
>> reducer.
>>  
>>> 
>>> 3. How does Riak handle intermediate data generated by mappers?
>>> Specifically:
>>> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and
>>> the output from all mappers are first grouped based on keys, and then handed
>>> over to the reducer. Does Riak do similar grouping of intermediate data?
>> 
>> The only reason for the intermediate grouping/scratch work in Hadoop MR jobs 
>> is to deal with multiple reducers. Although, I'm not entirely sure how this 
>> works in Riak, my suspicion is that data is streamed across the wire after 
>> the data is read from disk. 
>>  
>>> 
>>> (2) How are mapper outputs transmitted to the reducer? Does Riak use local
>>> disks on the mapper nodes or reducer nodes to store the intermediate data
>>> temporarily?
>> 
>> Since large MR jobs can cause out of memory errors, you can bet good money 
>> that the answer is "no".
>>  
>>> 
>>> 4. According to the document
>>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work ,
>>> each MR job only schedules one reducer, which runs on the coordinate node.
>>> Is there any way to configure a MR job to use multiple reducers?
>> 
>> Using Riak MR, there's no way to create a job that runs reducers on multiple 
>> nodes. You can have multiple reducer processes on a single node, but not 
>> reducers on multiple nodes.
>>  
>>> 
>>> Best regards,
>>> Xiaoming
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html
>>> Sent from the Riak Users mailing list archive at Nabble.com.
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> _______________________________________________ 
>> riak-users mailing list 
>> [hidden email] 
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028474.html
>> To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click 
>> here.
>> NAML
> 
> 
> View this message in context: Re: Comparing Riak MapReduce and Hadoop 
> MapReduce
> Sent from the Riak Users mailing list archive at Nabble.com.
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Comparing Riak MapReduce and Hadoop MapReduce

Reply via email to