Spark version: 1.6.0 So, here is the background: I have a data frame (Large_Row_DataFrame) which I have created from an array of row objects and also have another array of unique ids (U_ID) which I’m going to use to look up into the Large_Row_DataFrame (which is cached) to do a customized function. For the each lookup for each unique id, I do a collect on the cached dataframe Large_Row_DataFrame. This means that they would be a bunch of ‘collect’ actions which Spark has to run. Since I’m executing this in a loop for each unique id (U_ID), all the such collect actions run in sequential mode.
Solution that I implemented: To avoid the sequential wait of each collect, I have created few subsets of unique ids with a specific size and run each thread for such a subset. For each such subset, I executed a thread which is a spark job that runs collects in sequence only for that subset. And, I have created as many threads as subsets, each thread handling each subset. Surprisingly, The resultant run time is better than the earlier sequential approach. Now the question: Is the multithreading a correct approach towards the solution? Or could there be a better way of doing this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org