Re: best practice for paralleling model training

Jacek Laskowski Tue, 24 Jan 2017 14:18:58 -0800

Hi Shiyuan,

Re 1) Yes, but it has (almost) nothing to do with Spark since model1 =
pipeline1.fit(df) is a blocking operation and therefore the following
line will only be executed after this line has finished.


Re 2) Use a concurrency library like Java's
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jan 24, 2017 at 10:48 PM, Shiyuan <gshy2...@gmail.com> wrote:
> Hi spark users,
> I am looking for a way to paralleling #A and #B in the code below.   Since
> dataframe in spark is immutable,  #A and #B are completely separated
> operations
>
> My question is:
> 1). As for spark 2.1,  #B only starts when #A is completed.  Is it right?
> 2).  What's the best way to parallelize #A and #B given infinite number of
> computing nodes?
>
> Any explanations or pointers are appreciated!
>
>
> df = spark.createDataframe(...)
>
> model1 = pipeline1.fit(df)   #A
> modle2 = pipeline2.fit(df)  #B
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: best practice for paralleling model training

Reply via email to