Your approach requires still that all data is sent via the master every time you want to process it. You probably want to use Hadoop HDFS as a distributed file system and exploit its data locality features. This seems to be very suitable for your scenario. However, I recommend to use it with a Hadoop distribution including Spark. Windows is only supported by Hortonworks. Alternatively, you can create your own distribution. It is feasible, but more effort.
> On 27 Nov 2015, at 23:29, Shuo Wang <shuo.x.w...@gmail.com> wrote: > > Hi, > > I am trying to build a small home spark cluster on windows. I have a > question regarding how to share the data files for the master node and worker > nodes to process. The data files are pretty large, a few 100G. > > Can I just use windows shared folder as the file path for my driver/master, > and worker nodes, where my worker nodes exist on the same LAN as my > driver/master, and the shared folder is on my master node? > > -- > 王硕 > 邮箱:shuo.x.w...@gmail.com > Whatever your journey, keep walking.