Hi, In the standalone mode, how can we check data locality is working as expected when tasks are assigned?
Thanks! On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sandy.r...@cloudera.com> wrote: > On standalone there is still special handling for assigning tasks within > executors. There just isn't special handling for where to place executors, > because standalone generally places an executor on every node. > > > On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > Sandy, > > > > I just tried the standalone cluster and didn't have chance to try Yarn yet. > > So if I understand correctly, there are *no* special handling of task > assignment according to the HDFS block's location when Spark is running as a > *standalone* cluster. > > Please correct me if I'm wrong. Thank you for your patience! > > > > From: Sandy Ryza [mailto:sandy.r...@cloudera.com] > Sent: 2014年7月22日 9:47 > > > To: user@spark.apache.org > Subject: Re: data locality > > > > This currently only works for YARN. The standalone default is to place an > executor on every node for every job. > > > > The total number of executors is specified by the user. > > > > -Sandy > > > > On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > > Sandy, > > > > Do you mean the “preferred location” is working for standalone cluster also? > Because I check the code of SparkContext and see comments as below: > > > > // This is used only by YARN for now, but should be relevant to other > cluster types (Mesos, > > // etc) too. This is typically generated from > InputFormatInfo.computePreferredLocations. It > > // contains a map from hostname to a list of input format splits on the > host. > > private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = > Map() > > > > BTW, even with the preferred hosts, how does Spark decide how many total > executors to use for this application? > > > > Thanks again! > > > > From: Sandy Ryza [mailto:sandy.r...@cloudera.com] > Sent: Friday, July 18, 2014 3:44 PM > To: user@spark.apache.org > Subject: Re: data locality > > > > Hi Haopu, > > > > Spark will ask HDFS for file block locations and try to assign tasks based on > these. > > > > There is a snag. Spark schedules its tasks inside of "executor" processes > that stick around for the lifetime of a Spark application. Spark requests > executors before it runs any jobs, i.e. before it has any information about > where the input data for the jobs is located. If the executors occupy > significantly fewer nodes than exist in the cluster, it can be difficult for > Spark to achieve data locality. The workaround for this is an API that > allows passing in a set of preferred locations when instantiating a Spark > context. This API is currently broken in Spark 1.0, and will likely changed > to be something a little simpler in a future release. > > > > val locData = InputFormatInfo.computePreferredLocations > > (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new > Path(“myfile.txt”))) > > > > val sc = new SparkContext(conf, locData) > > > > -Sandy > > > > > > On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote: > > I have a standalone spark cluster and a HDFS cluster which share some of > nodes. > > > > When reading HDFS file, how does spark assign tasks to nodes? Will it ask > HDFS the location for each file block in order to get a right worker node? > > > > How about a spark cluster on Yarn? > > > > Thank you very much! > > > > > > > >