Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Li

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Nick Pentreath
Can you key your RDD by some key and use reduceByKey? In fact if you are merging bunch of maps you can create a set of (k, v) in your mapPartitions and then reduceByKey using some merge function. The reduce will happen in parallel on multiple nodes in this case. You'll end up with just a single

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
I suppose what I want is the memory efficiency of toLocalIterator and the speed of collect. Is there any such thing? On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung wrote: > Hello, > > I noticed that the final reduce function happens in the driver node with a > code that looks like the followin