Instead of .map you can try doing a .mapPartitions and see the performance.

Thanks
Best Regards

On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> For a large dataset, I want to filter out something and then do the
> computing intensive work.
>
> What I am doing now:
>
> Data.filter(somerules).cache()
> Data.count()
>
> Data.map(timeintensivecompute)
>
> But this sometimes takes unusually long time due to cache missing and
> recalculation.
>
> So I changed to this way.
>
> Data.filter.saveasTextFile()
>
> sc.testFile(),map(timeintesivecompute)
>
> Second one is even faster.
>
> How could I tune the job to reach maximum performance?
>
> Thank you.
>
>

Reply via email to