I hate to say this, but your friend is right. Spark slaves (executors)
really pull the data. In fact, it is a standard practice in distributed
world, eg Hadoop. It is not practical to pass large amount of data through
master nor it gives a way to parallely read the data.

You can either use spark's own way of splitting the data, or you can use
hadoop style input formats which gives a set of splits, or you can use self
describing formats like parquet. But essentially your api to data has to
have a concept of cutting data to chunks, and spark just uses that to
decide which "slave" to pull what.

(mesos has nothing to do with this. its just a cluster manager as far as
spark is concerned)

On Thu, May 28, 2015 at 12:22 AM, Stephen Carman <scar...@coldlight.com>
wrote:

> A colleague and I were having a discussion and we were disagreeing about
> something in Spark/Mesos that perhaps someone can shed some light into.
>
> We have a mesos cluster that runs spark via a sparkHome, rather than
> downloading an executable and such.
>
> My colleague says that say we have parquet files in S3, that slaves should
> know what data is in their partition and only pull from the S3 the
> partitions of parquet data they need, but this seems inherinitly wrong to
> me.
> as I have no idea how it’s possible for Spark or Mesos to know what
> partitions to know what to pull on the slave. It makes much more sense to
> me for the partitioning to be done on the driver and then distributed to the
> slaves so the slaves don’t have to necessarily worry about these details.
> If this were the case there is some data loading that is done on the
> driver, correct? Or does spark/mesos do some magic to pass a reference so
> the slaves
> know what to pull per say?
>
> So I guess in summation, where does partitioning and data loading happen?
> On the driver or on the executor?
>
> Thanks,
> Steve
> This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
>



-- 
Best Regards,
Ayan Guha

Reply via email to