Hi, Suppose I have a file locally on my master machine and the same file is also present in the same path on all the worker machines , say /home/user_name/Desktop. I wanted to know that when we partition the data using sc.parallelize , Spark actually broadcasts parts of the RDD to all the worker machines or it reads the corresponding segment locally from the memory of the worker machine?
How to I avoid movement of this data? Will it help if I store the file in HDFS? Thanks and Regards, Disha