Hi Shiyuan, Re 1) Yes, but it has (almost) nothing to do with Spark since model1 = pipeline1.fit(df) is a blocking operation and therefore the following line will only be executed after this line has finished.
Re 2) Use a concurrency library like Java's https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jan 24, 2017 at 10:48 PM, Shiyuan <gshy2...@gmail.com> wrote: > Hi spark users, > I am looking for a way to paralleling #A and #B in the code below. Since > dataframe in spark is immutable, #A and #B are completely separated > operations > > My question is: > 1). As for spark 2.1, #B only starts when #A is completed. Is it right? > 2). What's the best way to parallelize #A and #B given infinite number of > computing nodes? > > Any explanations or pointers are appreciated! > > > df = spark.createDataframe(...) > > model1 = pipeline1.fit(df) #A > modle2 = pipeline2.fit(df) #B > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org