There should not be any need to explicitly make DF-2, DF-3 computation
parallel. Spark generates execution plans and it can decide what can run in
parallel (ideally you should see them running parallel in spark UI).

You need to cache DF-1 if possible (either in memory/disk), otherwise
computation of DF-2 and DF-3 might trigger the DF-1 computation in
duplicate.

--
Raghavendra


On Sat, Dec 5, 2020 at 12:31 AM Artemis User <arte...@dtechspace.com> wrote:

> We have a Spark job that produces a result data frame, say DF-1 at the
> end of the pipeline (i.e. Proc-1).  From DF-1, we need to create two or
> more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes,
> i.e. Proc-2 and Proc-3.  Ideally, we would like to perform Proc-2 and
> Proc-3 in parallel, since Proc-2 and Proc-3 can be executed
> independently, with DF-1 made immutable and DF-2 and DF-3 are
> mutual-exclusive.
>
> Does Spark has some built-in APIs to support spawning sub-jobs in a
> single session?  If multi-threading is needed, what are the common best
> practices in this case?
>
> Thanks in advance for your help!
>
> -- ND
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to