I found the following to be a good discussion of the same topic:
http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
 


> From: so...@cloudera.com
> Date: Tue, 20 Jan 2015 10:02:20 +0000
> Subject: Re: Does Spark automatically run different stages concurrently when 
> possible?
> To: paliwalash...@gmail.com
> CC: davidkl...@hotmail.com; user@spark.apache.org
> 
> You can persist the RDD in (2) right after it is created. It will not
> cause it to be persisted immediately, but rather the first time it is
> materialized. If you persist after (3) is calculated, then it will be
> re-calculated (and persisted) after (4) is calculated.
> 
> On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote:
> > Sean,
> >
> > A related question. When to persist the RDD after step 2 or after Step
> > 3 (nothing would happen before step 3 I assume)?
> >
> > On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote:
> >> From the OP:
> >>
> >> (1) val lines = Import full dataset using sc.textFile
> >> (2) val ABonly = Filter out all rows from "lines" that are not of type A 
> >> or B
> >> (3) val processA = Process only the A rows from ABonly
> >> (4) val processB = Process only the B rows from ABonly
> >>
> >> I assume that 3 and 4 are actions, or else nothing happens here at all.
> >>
> >> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
> >> after 3, and may even cause 1 and 2 to happen again if nothing is
> >> persisted.
> >>
> >> You can invoke 3 and 4 in parallel on the driver if you like. That's
> >> fine. But actions are blocking in the driver.
> >>
> >>
> >>
> >> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com> wrote:
> >>> Hi Jon, I am looking for an answer for a similar question in the doc now, 
> >>> so
> >>> far no clue.
> >>>
> >>> I would need to know what is spark behaviour in a situation like the 
> >>> example
> >>> you provided, but taking into account also that there are multiple
> >>> partitions/workers.
> >>>
> >>> I could imagine it's possible that different spark workers are not
> >>> synchronized in terms of waiting for each other to progress to the next
> >>> step/stage for the partitions of data they get assigned, while I believe 
> >>> in
> >>> streaming they would wait for the current batch to complete before they
> >>> start working on a new one.
> >>>
> >>> In the code I am working on, I need to make sure a particular step is
> >>> completed (in all workers, for all partitions) before next transformation 
> >>> is
> >>> applied.
> >>>
> >>> Would be great if someone could clarify or point to these issues in the 
> >>> doc!
> >>> :-)
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context: 
> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html
> >>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
                                          

Reply via email to