Instead of .map you can try doing a .mapPartitions and see the performance.
Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: > > Data.filter(somerules).cache() > Data.count() > > Data.map(timeintensivecompute) > > But this sometimes takes unusually long time due to cache missing and > recalculation. > > So I changed to this way. > > Data.filter.saveasTextFile() > > sc.testFile(),map(timeintesivecompute) > > Second one is even faster. > > How could I tune the job to reach maximum performance? > > Thank you. > >