Hi folks,
We have written a spark job that scans multiple hdfs directories and
perform transformations on them.
For now, this is done with a simple for loop that starts one task at
each iteration. This looks like:
dirs.foreach { case (src,dest) => sc.textFile(src).process.saveAsFile(dest) }
However, each iteration is independent, and we would like to optimize
that by running
them with spark simultaneously (or in a chained fashion), such that we
don't have
idle executors at the end of each iteration (some directories
sometimes only require one partition)
Has anyone already done such a thing? How would you suggest we could do that?
Cheers,
Anselme
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]