I was looking for related information and found: http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx
See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread for how short circuit read is enabled. Cheers On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen <[email protected]> wrote: > Spark uses MapReduce InputFormat implementations to read data from > disk, so in that sense it has access to, and uses, the same locality > info that things like MR do. Yes, tasks go to the data, and you want > to run Spark on top of the HDFS DataNodes. (Locality isn't always the > only priority that determines where tasks are scheduled, but it > certainly matters.) I'm not qualified enough to explain it in more > detail, compared to others here. > > On Fri, Jan 9, 2015 at 10:13 PM, zfry <[email protected]> wrote: > > I am running Spark 1.1.1 built against CDH4 and have a few questions > > regarding Spark performance related to co-location with HDFS nodes. > > > > I want to know whether (and how efficiently) Spark takes advantage of > being > > co-located with a HDFS node? > > > > What I mean by this is: if a file is being read by a Spark executor and > that > > file (or most of its blocks) is located in a HDFS DataNode on the same > > machine as a Spark worker, will it read directly off of disk, or does > that > > data have to travel through the network in some way? Is there a distinct > > advantage to putting HDFS and Spark on the same box if it is possible or, > > due to the way blocks are distributed about a cluster, are we so likely > to > > be moving files over the network that co-location doesn’t really make > that > > much of a difference? > > > > Also, do you know of any papers/books/other resources (other trying to > dig > > through the spark code) which do a good job of explaining the Spark/HDFS > > data workflow (ie. how data moves from disk -> HDFS -> Spark -> HDFS)? > > > > Thanks! > > Zach > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-and-HDFS-co-location-tp21070.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
