A map followed by a filter will not be two stages, but rather one stage that pipelines the map and filter.
> On Jan 20, 2015, at 10:26 AM, Kane Kim <kane.ist...@gmail.com> wrote: > > Related question - is execution of different stages optimized? I.e. > map followed by a filter will require 2 loops or they will be combined > into single one? > >> On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay <btier...@hotmail.com> wrote: >> I found the following to be a good discussion of the same topic: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html >> >> >>> From: so...@cloudera.com >>> Date: Tue, 20 Jan 2015 10:02:20 +0000 >>> Subject: Re: Does Spark automatically run different stages concurrently >>> when possible? >>> To: paliwalash...@gmail.com >>> CC: davidkl...@hotmail.com; user@spark.apache.org >> >>> >>> You can persist the RDD in (2) right after it is created. It will not >>> cause it to be persisted immediately, but rather the first time it is >>> materialized. If you persist after (3) is calculated, then it will be >>> re-calculated (and persisted) after (4) is calculated. >>> >>>> On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote: >>>> Sean, >>>> >>>> A related question. When to persist the RDD after step 2 or after Step >>>> 3 (nothing would happen before step 3 I assume)? >>>> >>>>> On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote: >>>>> From the OP: >>>>> >>>>> (1) val lines = Import full dataset using sc.textFile >>>>> (2) val ABonly = Filter out all rows from "lines" that are not of type >>>>> A or B >>>>> (3) val processA = Process only the A rows from ABonly >>>>> (4) val processB = Process only the B rows from ABonly >>>>> >>>>> I assume that 3 and 4 are actions, or else nothing happens here at all. >>>>> >>>>> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen >>>>> after 3, and may even cause 1 and 2 to happen again if nothing is >>>>> persisted. >>>>> >>>>> You can invoke 3 and 4 in parallel on the driver if you like. That's >>>>> fine. But actions are blocking in the driver. >>>>> >>>>> >>>>> >>>>> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com> >>>>> wrote: >>>>>> Hi Jon, I am looking for an answer for a similar question in the doc >>>>>> now, so >>>>>> far no clue. >>>>>> >>>>>> I would need to know what is spark behaviour in a situation like the >>>>>> example >>>>>> you provided, but taking into account also that there are multiple >>>>>> partitions/workers. >>>>>> >>>>>> I could imagine it's possible that different spark workers are not >>>>>> synchronized in terms of waiting for each other to progress to the >>>>>> next >>>>>> step/stage for the partitions of data they get assigned, while I >>>>>> believe in >>>>>> streaming they would wait for the current batch to complete before >>>>>> they >>>>>> start working on a new one. >>>>>> >>>>>> In the code I am working on, I need to make sure a particular step is >>>>>> completed (in all workers, for all partitions) before next >>>>>> transformation is >>>>>> applied. >>>>>> >>>>>> Would be great if someone could clarify or point to these issues in >>>>>> the doc! >>>>>> :-) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> -- >>>> thanks >>>> ashish >>>> >>>> Blog: http://www.ashishpaliwal.com/blog >>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org