Spark version: 1.6.0 
So, here is the background:

        I have a data frame (Large_Row_DataFrame) which I have created from an
array of row objects and also have another array of unique ids (U_ID) which
I’m going to use to look up into the Large_Row_DataFrame (which is cached)
to do a customized function. 
       For the each lookup for each unique id, I do a collect on the cached
dataframe Large_Row_DataFrame. This means that they would be a bunch of
‘collect’ actions which Spark has to run. Since I’m executing this in a loop
for each unique id (U_ID), all the such collect actions run in sequential
mode. 

Solution that I implemented:

To avoid the sequential wait of each collect, I have created few subsets of
unique ids with a specific size and run each thread for such a subset. For
each such subset, I executed a thread which is a spark job that runs
collects in sequence only for that subset. And, I have created as many
threads as subsets, each thread handling each subset. Surprisingly, The
resultant run time is better than the earlier sequential approach.

Now the question:

        Is the multithreading a correct approach towards the solution? Or could
there be a better way of doing this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to