Hello,
I went through the Riak Fast Track and overall I'm excited about the
possibilities of Riak. However, I did see one thing that concerns me in regard
to MapReduce jobs. From the Riak docs and this page
here: http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/
it seems to suggest that while Map jobs are performed in parallel Reduce jobs
are not.
Wouldn't be better if each node performed the reduce function on it's own set
of data before sending it to the main coordinator? Imagine if you are mapping
over a high number of objects (millions) and are trying to aggregate the data
grouped by some demographic (such as state). If all the mappers send the data
to coordinating node to be reduced then you are going to send millions of items
from each data node all to one coordinating node. If instead each node called
reduce first, then sent it to the coordinating node, then each node would be at
most sending over 50 objects (for 50 states). Not only would this greatly
reduce network traffic, it would be able to execute in parallel greatly
improving performance. Since all reduce jobs are communicative, associative and
idempotent I see no reason why this couldn't happen (of course I'm not familiar
with the Riak internals to strongly assert this).
Maybe I misunderstood and this is already happening. If it's not happening, why
and is that on the roadmap to be added later?
-John
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com