I was looking for related information and found:
http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx

See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread for
how short circuit read is enabled.

Cheers

On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen <[email protected]> wrote:

> Spark uses MapReduce InputFormat implementations to read data from
> disk, so in that sense it has access to, and uses, the same locality
> info that things like MR do. Yes, tasks go to the data, and you want
> to run Spark on top of the HDFS DataNodes. (Locality isn't always the
> only priority that determines where tasks are scheduled, but it
> certainly matters.) I'm not qualified enough to explain it in more
> detail, compared to others here.
>
> On Fri, Jan 9, 2015 at 10:13 PM, zfry <[email protected]> wrote:
> > I am running Spark 1.1.1 built against CDH4 and have a few questions
> > regarding Spark performance related to co-location with HDFS nodes.
> >
> > I want to know whether (and how efficiently) Spark takes advantage of
> being
> > co-located with a HDFS node?
> >
> > What I mean by this is: if a file is being read by a Spark executor and
> that
> > file (or most of its blocks) is located in a HDFS DataNode on the same
> > machine as a Spark worker, will it read directly off of disk, or does
> that
> > data have to travel through the network in some way? Is there a distinct
> > advantage to putting HDFS and Spark on the same box if it is possible or,
> > due to the way blocks are distributed about a cluster, are we so likely
> to
> > be moving files over the network that co-location doesn’t really make
> that
> > much of a difference?
> >
> > Also, do you know of any papers/books/other resources (other trying to
> dig
> > through the spark code) which do a good job of explaining the Spark/HDFS
> > data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)?
> >
> > Thanks!
> > Zach
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to