Re: Tuning level of Parallelism: Increase or decrease?

Jestin Ma Tue, 02 Aug 2016 16:12:06 -0700

Hi Jacek,
I found this page of your book here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html

which says:  "It is therefore important to have Spark running on Hadoop
YARN cluster
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> if
the data comes from HDFS. In Spark on YARN
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> Spark
tries to place tasks alongside HDFS blocks."

So my reasoning was that since Spark takes care of data locality when
workers load data from HDFS, I can't see why running on YARN is more
important.

Hope this makes my question clearer.

On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <jestinwith.a...@gmail.com>
> wrote:
> > Hi Nikolay, I'm looking at data locality improvements for Spark, and I
> have
> > conflicting sources on using YARN for Spark.
> >
> > Reynold said that Spark workers automatically take care of data locality
> > here:
> >
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
> >
> > However, I've read elsewhere
> > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/
> )
> > that Spark on YARN increases data locality because YARN tries to place
> tasks
> > next to HDFS blocks.
> >
> > Can anyone verify/support one side or the other?
>
> Hi Jestin,
>
> I'm the author of the latter. I can't seem to find how Reynold
> "conflicts" with what I wrote in the notes? Could you elaborate?
>
> I certainly may be wrong.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>

Re: Tuning level of Parallelism: Increase or decrease?

Reply via email to