Re: Questions about Spark and HDFS co-location

Sean Owen Fri, 09 Jan 2015 15:52:08 -0800

Spark uses MapReduce InputFormat implementations to read data from
disk, so in that sense it has access to, and uses, the same locality
info that things like MR do. Yes, tasks go to the data, and you want
to run Spark on top of the HDFS DataNodes. (Locality isn't always the
only priority that determines where tasks are scheduled, but it
certainly matters.) I'm not qualified enough to explain it in more
detail, compared to others here.


On Fri, Jan 9, 2015 at 10:13 PM, zfry <[email protected]> wrote:
> I am running Spark 1.1.1 built against CDH4 and have a few questions
> regarding Spark performance related to co-location with HDFS nodes.
>
> I want to know whether (and how efficiently) Spark takes advantage of being
> co-located with a HDFS node?
>
> What I mean by this is: if a file is being read by a Spark executor and that
> file (or most of its blocks) is located in a HDFS DataNode on the same
> machine as a Spark worker, will it read directly off of disk, or does that
> data have to travel through the network in some way? Is there a distinct
> advantage to putting HDFS and Spark on the same box if it is possible or,
> due to the way blocks are distributed about a cluster, are we so likely to
> be moving files over the network that co-location doesn’t really make that
> much of a difference?
>
> Also, do you know of any papers/books/other resources (other trying to dig
> through the spark code) which do a good job of explaining the Spark/HDFS
> data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)?
>
> Thanks!
> Zach
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Questions about Spark and HDFS co-location

Reply via email to