Related question - is execution of different stages optimized? I.e.
map followed by a filter will require 2 loops or they will be combined
into single one?

On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay <btier...@hotmail.com> wrote:
> I found the following to be a good discussion of the same topic:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
>
>
>> From: so...@cloudera.com
>> Date: Tue, 20 Jan 2015 10:02:20 +0000
>> Subject: Re: Does Spark automatically run different stages concurrently
>> when possible?
>> To: paliwalash...@gmail.com
>> CC: davidkl...@hotmail.com; user@spark.apache.org
>
>>
>> You can persist the RDD in (2) right after it is created. It will not
>> cause it to be persisted immediately, but rather the first time it is
>> materialized. If you persist after (3) is calculated, then it will be
>> re-calculated (and persisted) after (4) is calculated.
>>
>> On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote:
>> > Sean,
>> >
>> > A related question. When to persist the RDD after step 2 or after Step
>> > 3 (nothing would happen before step 3 I assume)?
>> >
>> > On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote:
>> >> From the OP:
>> >>
>> >> (1) val lines = Import full dataset using sc.textFile
>> >> (2) val ABonly = Filter out all rows from "lines" that are not of type
>> >> A or B
>> >> (3) val processA = Process only the A rows from ABonly
>> >> (4) val processB = Process only the B rows from ABonly
>> >>
>> >> I assume that 3 and 4 are actions, or else nothing happens here at all.
>> >>
>> >> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
>> >> after 3, and may even cause 1 and 2 to happen again if nothing is
>> >> persisted.
>> >>
>> >> You can invoke 3 and 4 in parallel on the driver if you like. That's
>> >> fine. But actions are blocking in the driver.
>> >>
>> >>
>> >>
>> >> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com>
>> >> wrote:
>> >>> Hi Jon, I am looking for an answer for a similar question in the doc
>> >>> now, so
>> >>> far no clue.
>> >>>
>> >>> I would need to know what is spark behaviour in a situation like the
>> >>> example
>> >>> you provided, but taking into account also that there are multiple
>> >>> partitions/workers.
>> >>>
>> >>> I could imagine it's possible that different spark workers are not
>> >>> synchronized in terms of waiting for each other to progress to the
>> >>> next
>> >>> step/stage for the partitions of data they get assigned, while I
>> >>> believe in
>> >>> streaming they would wait for the current batch to complete before
>> >>> they
>> >>> start working on a new one.
>> >>>
>> >>> In the code I am working on, I need to make sure a particular step is
>> >>> completed (in all workers, for all partitions) before next
>> >>> transformation is
>> >>> applied.
>> >>>
>> >>> Would be great if someone could clarify or point to these issues in
>> >>> the doc!
>> >>> :-)
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > thanks
>> > ashish
>> >
>> > Blog: http://www.ashishpaliwal.com/blog
>> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to