Spark will skip the stage if it is computed by other jobs. That means the
common parent RDD of each job only needs to be computed once. But it is
still multiple sequential jobs, not concurrent jobs.
On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba wrote:
> Hi Andy,
>
> its nice to see that we are not
Hi Andy,
its nice to see that we are not the only ones with the same issues. So
far we have not gone as far as you have. What we have done is that we
cache whatever dataframes/rdds are shared foc computing different
output. This has brought us quite the speedup, but we still see that
saving some l
We have a somewhat complex pipeline which has multiple output files on
HDFS, and we'd like the materialization of those outputs to happen
concurrently.
Internal to Spark, any "save" call creates a new "job", which runs
synchronously -- that is, the line of code after your save() executes once
the