Hi Jacek, I found this page of your book here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
which says: "It is therefore important to have Spark running on Hadoop YARN cluster <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> if the data comes from HDFS. In Spark on YARN <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> Spark tries to place tasks alongside HDFS blocks." So my reasoning was that since Spark takes care of data locality when workers load data from HDFS, I can't see why running on YARN is more important. Hope this makes my question clearer. On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl> wrote: > On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <jestinwith.a...@gmail.com> > wrote: > > Hi Nikolay, I'm looking at data locality improvements for Spark, and I > have > > conflicting sources on using YARN for Spark. > > > > Reynold said that Spark workers automatically take care of data locality > > here: > > > https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS > > > > However, I've read elsewhere > > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/ > ) > > that Spark on YARN increases data locality because YARN tries to place > tasks > > next to HDFS blocks. > > > > Can anyone verify/support one side or the other? > > Hi Jestin, > > I'm the author of the latter. I can't seem to find how Reynold > "conflicts" with what I wrote in the notes? Could you elaborate? > > I certainly may be wrong. > > Pozdrawiam, > Jacek Laskowski > ---- > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski >