This currently only works for YARN. The standalone default is to place an executor on every node for every job.
The total number of executors is specified by the user. -Sandy On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > Sandy, > > > > Do you mean the “preferred location” is working for standalone cluster > also? Because I check the code of SparkContext and see comments as below: > > > > // This is used only by YARN for now, but should be relevant to other > cluster types (*Mesos*, > > // etc) too. This is typically generated from > InputFormatInfo.computePreferredLocations. It > > // contains a map from *hostname* to a list of input format splits on > the host. > > *private*[spark] *var* preferredNodeLocationData: Map[String, > Set[SplitInfo]] = Map() > > > > BTW, even with the preferred hosts, how does Spark decide how many total > executors to use for this application? > > > > Thanks again! > > > ------------------------------ > > *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] > *Sent:* Friday, July 18, 2014 3:44 PM > *To:* user@spark.apache.org > *Subject:* Re: data locality > > > > Hi Haopu, > > > > Spark will ask HDFS for file block locations and try to assign tasks based > on these. > > > > There is a snag. Spark schedules its tasks inside of "executor" processes > that stick around for the lifetime of a Spark application. Spark requests > executors before it runs any jobs, i.e. before it has any information about > where the input data for the jobs is located. If the executors occupy > significantly fewer nodes than exist in the cluster, it can be difficult > for Spark to achieve data locality. The workaround for this is an API that > allows passing in a set of preferred locations when instantiating a Spark > context. This API is currently broken in Spark 1.0, and will likely > changed to be something a little simpler in a future release. > > > > val locData = InputFormatInfo.computePreferredLocations > > (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new > Path(“myfile.txt”))) > > > > val sc = new SparkContext(conf, locData) > > > > -Sandy > > > > > > On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > > I have a standalone spark cluster and a HDFS cluster which share some of > nodes. > > > > When reading HDFS file, how does spark assign tasks to nodes? Will it ask > HDFS the location for each file block in order to get a right worker node? > > > > How about a spark cluster on Yarn? > > > > Thank you very much! > > > > >