There should not be any need to explicitly make DF-2, DF-3 computation parallel. Spark generates execution plans and it can decide what can run in parallel (ideally you should see them running parallel in spark UI).
You need to cache DF-1 if possible (either in memory/disk), otherwise computation of DF-2 and DF-3 might trigger the DF-1 computation in duplicate. -- Raghavendra On Sat, Dec 5, 2020 at 12:31 AM Artemis User <arte...@dtechspace.com> wrote: > We have a Spark job that produces a result data frame, say DF-1 at the > end of the pipeline (i.e. Proc-1). From DF-1, we need to create two or > more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes, > i.e. Proc-2 and Proc-3. Ideally, we would like to perform Proc-2 and > Proc-3 in parallel, since Proc-2 and Proc-3 can be executed > independently, with DF-1 made immutable and DF-2 and DF-3 are > mutual-exclusive. > > Does Spark has some built-in APIs to support spawning sub-jobs in a > single session? If multi-threading is needed, what are the common best > practices in this case? > > Thanks in advance for your help! > > -- ND > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >