I hate to say this, but your friend is right. Spark slaves (executors) really pull the data. In fact, it is a standard practice in distributed world, eg Hadoop. It is not practical to pass large amount of data through master nor it gives a way to parallely read the data.
You can either use spark's own way of splitting the data, or you can use hadoop style input formats which gives a set of splits, or you can use self describing formats like parquet. But essentially your api to data has to have a concept of cutting data to chunks, and spark just uses that to decide which "slave" to pull what. (mesos has nothing to do with this. its just a cluster manager as far as spark is concerned) On Thu, May 28, 2015 at 12:22 AM, Stephen Carman <scar...@coldlight.com> wrote: > A colleague and I were having a discussion and we were disagreeing about > something in Spark/Mesos that perhaps someone can shed some light into. > > We have a mesos cluster that runs spark via a sparkHome, rather than > downloading an executable and such. > > My colleague says that say we have parquet files in S3, that slaves should > know what data is in their partition and only pull from the S3 the > partitions of parquet data they need, but this seems inherinitly wrong to > me. > as I have no idea how it’s possible for Spark or Mesos to know what > partitions to know what to pull on the slave. It makes much more sense to > me for the partitioning to be done on the driver and then distributed to the > slaves so the slaves don’t have to necessarily worry about these details. > If this were the case there is some data loading that is done on the > driver, correct? Or does spark/mesos do some magic to pass a reference so > the slaves > know what to pull per say? > > So I guess in summation, where does partitioning and data loading happen? > On the driver or on the executor? > > Thanks, > Steve > This e-mail is intended solely for the above-mentioned recipient and it > may contain confidential or privileged information. If you have received it > in error, please notify us immediately and delete the e-mail. You must not > copy, distribute, disclose or take any action in reliance on it. In > addition, the contents of an attachment to this e-mail may contain software > viruses which could damage your own computer system. While ColdLight > Solutions, LLC has taken every reasonable precaution to minimize this risk, > we cannot accept liability for any damage which you sustain as a result of > software viruses. You should perform your own virus checks before opening > the attachment. > -- Best Regards, Ayan Guha