Hi spark users,
I am looking for a way to paralleling #A and #B in the code below.   Since
dataframe in spark is immutable,  #A and #B are completely separated
operations

My question is:
1). As for spark 2.1,  #B only starts when #A is completed.  Is it right?
2).  What's the best way to parallelize #A and #B given infinite number of
computing nodes?

Any explanations or pointers are appreciated!


df = spark.createDataframe(...)

model1 = pipeline1.fit(df)   #A
modle2 = pipeline2.fit(df)  #B

Reply via email to