Re: data locality

Sandy Ryza Mon, 21 Jul 2014 18:48:07 -0700

This currently only works for YARN.  The standalone default is to place an
executor on every node for every job.


The total number of executors is specified by the user.

-Sandy


On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote:

>    Sandy,
>
>
>
> Do you mean the “preferred location” is working for standalone cluster
> also? Because I check the code of SparkContext and see comments as below:
>
>
>
>   // This is used only by YARN for now, but should be relevant to other
> cluster types (*Mesos*,
>
>   // etc) too. This is typically generated from
> InputFormatInfo.computePreferredLocations. It
>
>   // contains a map from *hostname* to a list of input format splits on
> the host.
>
>   *private*[spark] *var* preferredNodeLocationData: Map[String,
> Set[SplitInfo]] = Map()
>
>
>
> BTW, even with the preferred hosts, how does Spark decide how many total
> executors to use for this application?
>
>
>
> Thanks again!
>
>
>  ------------------------------
>
> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
> *Sent:* Friday, July 18, 2014 3:44 PM
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> Hi Haopu,
>
>
>
> Spark will ask HDFS for file block locations and try to assign tasks based
> on these.
>
>
>
> There is a snag.  Spark schedules its tasks inside of "executor" processes
> that stick around for the lifetime of a Spark application.  Spark requests
> executors before it runs any jobs, i.e. before it has any information about
> where the input data for the jobs is located.  If the executors occupy
> significantly fewer nodes than exist in the cluster, it can be difficult
> for Spark to achieve data locality.  The workaround for this is an API that
> allows passing in a set of preferred locations when instantiating a Spark
> context.  This API is currently broken in Spark 1.0, and will likely
> changed to be something a little simpler in a future release.
>
>
>
> val locData = InputFormatInfo.computePreferredLocations
>
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
> Path(“myfile.txt”)))
>
>
>
> val sc = new SparkContext(conf, locData)
>
>
>
> -Sandy
>
>
>
>
>
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
>
> I have a standalone spark cluster and a HDFS cluster which share some of
> nodes.
>
>
>
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
> HDFS the location for each file block in order to get a right worker node?
>
>
>
> How about a spark cluster on Yarn?
>
>
>
> Thank you very much!
>
>
>
>
>

Re: data locality

Reply via email to