dirs.par.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) }
Is that sufficient for you? On Tuesday, December 2, 2014, Anselme Vignon <anselme.vig...@flaminem.com> wrote: > Hi folks, > > > We have written a spark job that scans multiple hdfs directories and > perform transformations on them. > > For now, this is done with a simple for loop that starts one task at > each iteration. This looks like: > > dirs.foreach { case (src,dest) => > sc.textFile(src).process.saveAsFile(dest) } > > > However, each iteration is independent, and we would like to optimize > that by running > them with spark simultaneously (or in a chained fashion), such that we > don't have > idle executors at the end of each iteration (some directories > sometimes only require one partition) > > > Has anyone already done such a thing? How would you suggest we could do > that? > > Cheers, > > Anselme > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > >