Ah, yeah, I'm mistaken about search partitioning. The docs are correct. I have no idea how the scheduling works.
If I had to guess, I would guess that it is a streaming operation. -- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop On Jul 21, 2013, at 10:08 PM, Xiaoming Gao <mko...@gmail.com> wrote: > Thanks a lot, Jeremiah! Your answers really help clarify the issues. > > Just one more question, by "document-based indices", do you mean > document-based partitioning for the indices? Because what I found in the > online document > http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduce > is "Search uses term-based partitioning – also known as a global index." I > am not sure if the implementation has changed for the latest version of Riak, > but if term-based partitioning is used, does that mean Riak will only > schedule the mappers after the whole list of <bucket, key> pair is returned > from the index? > > Thanks, > Xiaoming > > > On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <[hidden > email]> wrote: >> Responses inline. Hopefully they shed some light on the subject. >> >> --- >> Jeremiah Peschka - Founder, Brent Ozar Unlimited >> MCITP: SQL Server 2008, MVP >> Cloudera Certified Developer for Apache Hadoop >> >> >> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden email]> wrote: >>> Hi everyone, >>> >>> I am trying to learn about Riak MapReduce and comparing it with Hadoop >>> MapReduce, and there are some details that I am interested in but not >>> covered in the online documents. So hopefully we can get some help here >>> about the following questions? Thanks in advance! >> >> They're not at all similar. Hadoop MR is optimized for sequential data >> processing in large batches. Riak MR works better when you think of it like >> a multi-processing engine - you can perform work across a matching set of >> items and that work will be distributed across the cluster during map >> phases. >> >> Take a look at this thread for a bit of discussion about when you should use >> Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v >> >> Or, if you want to, you can run a Riak MR job across an entire bucket, which >> really is like scanning every table in an RDBMS while looking for rows from >> a single table. MR jobs run with an R of 1. So, at least there's that. >> >>> >>> 1. For a given MapReduce request (or to say, job), how does Riak decide how >>> many mappers to use for the job? For example, if I have 8 nodes and my data >>> are distributed across all nodes with an "N" value of 2, will I have 4 >>> mappers running on 4 nodes concurrently? Is it possible to have multiple >>> mappers (e.g., 4 or even 6) for the same MR job running on each node (for >>> better processing speed)? >> >> To the best of my recollection, this will be based on either: >> >> 1) If you're using JavaScript MR jobs, the number of mappers and reducers is >> controlled by the the map_js_vm_count and reduce_js_vm_count settings from >> each node's app.config file. >> 2) If you're using Erlang: magic. This will be handled by the Erlang VM and >> is based on number of processors and your overall Erlang VM configuration. >> >>> >>> 2. If I run a MapReduce job over the results of a Riak Search query, how >>> does Riak schedule the mappers based on the search results? >> >> Riak Search uses document-based indices - search will query every node in >> the cluster. Map phases happen and then results are then streamed to the >> reducer. >> >>> >>> 3. How does Riak handle intermediate data generated by mappers? >>> Specifically: >>> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and >>> the output from all mappers are first grouped based on keys, and then handed >>> over to the reducer. Does Riak do similar grouping of intermediate data? >> >> The only reason for the intermediate grouping/scratch work in Hadoop MR jobs >> is to deal with multiple reducers. Although, I'm not entirely sure how this >> works in Riak, my suspicion is that data is streamed across the wire after >> the data is read from disk. >> >>> >>> (2) How are mapper outputs transmitted to the reducer? Does Riak use local >>> disks on the mapper nodes or reducer nodes to store the intermediate data >>> temporarily? >> >> Since large MR jobs can cause out of memory errors, you can bet good money >> that the answer is "no". >> >>> >>> 4. According to the document >>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work , >>> each MR job only schedules one reducer, which runs on the coordinate node. >>> Is there any way to configure a MR job to use multiple reducers? >> >> Using Riak MR, there's no way to create a job that runs reducers on multiple >> nodes. You can have multiple reducer processes on a single node, but not >> reducers on multiple nodes. >> >>> >>> Best regards, >>> Xiaoming >>> >>> >>> >>> -- >>> View this message in context: >>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html >>> Sent from the Riak Users mailing list archive at Nabble.com. >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [hidden email] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> _______________________________________________ >> riak-users mailing list >> [hidden email] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> If you reply to this email, your message will be added to the discussion >> below: >> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028474.html >> To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click >> here. >> NAML > > > View this message in context: Re: Comparing Riak MapReduce and Hadoop > MapReduce > Sent from the Riak Users mailing list archive at Nabble.com. > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com