Re: Parallelize independent tasks

Victor Tso-Guillen Tue, 02 Dec 2014 09:09:32 -0800

dirs.par.foreach { case (src,dest) =>
sc.textFile(src).process.saveAsFile(dest) }


Is that sufficient for you?

On Tuesday, December 2, 2014, Anselme Vignon <anselme.vig...@flaminem.com>
wrote:

> Hi folks,
>
>
> We have written a spark job that scans multiple hdfs directories and
> perform transformations on them.
>
> For now, this is done with a simple for loop that starts one task at
> each iteration. This looks like:
>
> dirs.foreach { case (src,dest) =>
> sc.textFile(src).process.saveAsFile(dest) }
>
>
> However, each iteration is independent, and we would like to optimize
> that by running
> them with spark simultaneously (or in a chained fashion), such that we
> don't have
> idle executors at the end of each iteration (some directories
> sometimes only require one partition)
>
>
> Has anyone already done such a thing? How would you suggest we could do
> that?
>
> Cheers,
>
> Anselme
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Re: Parallelize independent tasks

Reply via email to