Re: data locality

Tsai Li Ming Fri, 25 Jul 2014 04:14:30 -0700

Hi,

In the standalone mode, how can we check data locality is working as expected 
when tasks are assigned?


Thanks!


On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> On standalone there is still special handling for assigning tasks within 
> executors.  There just isn't special handling for where to place executors, 
> because standalone generally places an executor on every node.
> 
> 
> On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <hw...@qilinsoft.com> wrote:
> Sandy,
> 
>  
> 
> I just tried the standalone cluster and didn't have chance to try Yarn yet.
> 
> So if I understand correctly, there are *no* special handling of task 
> assignment according to the HDFS block's location when Spark is running as a 
> *standalone* cluster.
> 
> Please correct me if I'm wrong. Thank you for your patience!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.r...@cloudera.com] 
> Sent: 2014年7月22日 9:47
> 
> 
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> This currently only works for YARN.  The standalone default is to place an 
> executor on every node for every job.
> 
>  
> 
> The total number of executors is specified by the user.
> 
>  
> 
> -Sandy
> 
>  
> 
> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
> 
> Sandy,
> 
>  
> 
> Do you mean the “preferred location” is working for standalone cluster also? 
> Because I check the code of SparkContext and see comments as below:
> 
>  
> 
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
> 
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
> 
>   // contains a map from hostname to a list of input format splits on the 
> host.
> 
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> 
>  
> 
> BTW, even with the preferred hosts, how does Spark decide how many total 
> executors to use for this application?
> 
>  
> 
> Thanks again!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.r...@cloudera.com] 
> Sent: Friday, July 18, 2014 3:44 PM
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> Hi Haopu,
> 
>  
> 
> Spark will ask HDFS for file block locations and try to assign tasks based on 
> these.
> 
>  
> 
> There is a snag.  Spark schedules its tasks inside of "executor" processes 
> that stick around for the lifetime of a Spark application.  Spark requests 
> executors before it runs any jobs, i.e. before it has any information about 
> where the input data for the jobs is located.  If the executors occupy 
> significantly fewer nodes than exist in the cluster, it can be difficult for 
> Spark to achieve data locality.  The workaround for this is an API that 
> allows passing in a set of preferred locations when instantiating a Spark 
> context.  This API is currently broken in Spark 1.0, and will likely changed 
> to be something a little simpler in a future release.
> 
>  
> 
> val locData = InputFormatInfo.computePreferredLocations
> 
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new 
> Path(“myfile.txt”)))
> 
>  
> 
> val sc = new SparkContext(conf, locData)
> 
>  
> 
> -Sandy
> 
>  
> 
>  
> 
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
> 
> I have a standalone spark cluster and a HDFS cluster which share some of 
> nodes.
> 
>  
> 
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask 
> HDFS the location for each file block in order to get a right worker node?
> 
>  
> 
> How about a spark cluster on Yarn?
> 
>  
> 
> Thank you very much!
> 
>  
> 
>  
> 
>  
> 
>

Re: data locality

Reply via email to