Hi spark users, I am looking for a way to paralleling #A and #B in the code below. Since dataframe in spark is immutable, #A and #B are completely separated operations
My question is: 1). As for spark 2.1, #B only starts when #A is completed. Is it right? 2). What's the best way to parallelize #A and #B given infinite number of computing nodes? Any explanations or pointers are appreciated! df = spark.createDataframe(...) model1 = pipeline1.fit(df) #A modle2 = pipeline2.fit(df) #B