Hello,
I went through the Riak Fast Track and overall I'm excited about the 
possibilities of Riak. However, I did see one thing that concerns me in regard 
to MapReduce jobs.  From the Riak docs and this page 
here: http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/
 it seems to suggest that while Map jobs are performed in parallel Reduce jobs 
are not. 
Wouldn't be better if each node performed the reduce function on it's own set 
of data before sending it to the main coordinator? Imagine if you are mapping 
over a high number of objects (millions) and are trying to aggregate the data 
grouped by some demographic (such as state). If all the mappers send the data 
to coordinating node to be reduced then you are going to send millions of items 
from each data node all to one coordinating node. If instead each node called 
reduce first, then sent it to the coordinating node, then each node would be at 
most sending over 50 objects (for 50 states). Not only would this greatly 
reduce network traffic, it would be able to execute in parallel greatly 
improving performance. Since all reduce jobs are communicative, associative and 
idempotent I see no reason why this couldn't happen (of course I'm not familiar 
with the Riak internals to strongly assert this).
Maybe I misunderstood and this is already happening. If it's not happening, why 
and is that on the roadmap to be added later?
-John                                     
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to