mapPartitions tried to hold data is memory which did not work for me.. I am doing flatMap followed by groupByKey now with HashPartitioner and number of blocks is 60 (Based on 120 cores I am running the job on)...
Now when the shuffle size < 100 GB it works fine...as flatMap shuffle goes to 200 GB, 400 GB...I am getting: FetchFailed(BlockManagerId(1, istgbd013.verizon.com, 44377, 0), shuffleId=37, mapId=8, reduceId=54) I have to shuffle because the memory on cluster is less than the shuffle size of 400 GB.. The job runs fine if I sample and decrease my shuffle size within 100 GB.. Does groupByKey does a combiner similar to reduceByKey and aggregateByKey ? I need a combiner operation to do some work on map side after flatMap followed by rest of the work on reducers.. On Wed, Nov 12, 2014 at 8:35 PM, Mayur Rustagi <mayur.rust...@gmail.com> wrote: > flatmap would have to shuffle data only if output RDD is expected to be > partitioned by some key. > RDD[X].flatmap(X=>RDD[Y]) > If it has to shuffle it should be local. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Hi, >> >> I am doing a flatMap followed by mapPartitions to do some blocked >> operation...flatMap is shuffling data but this shuffle is strictly >> shuffling to disk and not over the network right ? >> >> Thanks. >> Deb >> > >