Re: [Spark 2.0.2 HDFS]: no data locality

Karamba Wed, 28 Dec 2016 01:47:38 -0800

Hi Sun Rui,

thanks for answering!



> Although the Spark task scheduler is aware of rack-level data locality, it 
> seems that only YARN implements the support for it. 

This explains why the script that I configured in core-site.xml
topology.script.file.name is not called in by the spark container.
But at time of reading from hdfs in a spark program, the script is
called in my hdfs namenode container.

> However, node-level locality can still work for Standalone.

I have a couple of physical hosts that run spark and hdfs docker
containers. How does spark standalone knows that spark and docker
containers are on the same host?

> Data Locality involves in both task data locality and executor data locality. 
> Executor data locality is only supported on YARN with executor dynamic 
> allocation enabled. For standalone, by default, a Spark application will 
> acquire all available cores in the cluster, generally meaning there is at 
> least one executor on each node, in which case task data locality can work 
> because a task can be dispatched to an executor on any of the preferred nodes 
> of the task for execution.
>
> for your case, have you set spark.cores.max to limit the cores to acquire, 
> which means executors are available on a subset of the cluster nodes?
I set "--total-executor-cores 1" in order to use only a small subset of
the cluster.



On 28.12.2016 02:58, Sun Rui wrote:
> Although the Spark task scheduler is aware of rack-level data locality, it 
> seems that only YARN implements the support for it. However, node-level 
> locality can still work for Standalone.
>
> It is not necessary to copy the hadoop config files into the Spark CONF 
> directory. Set HADOOP_CONF_DIR to point to the conf directory of your Hadoop.
>
> Data Locality involves in both task data locality and executor data locality. 
> Executor data locality is only supported on YARN with executor dynamic 
> allocation enabled. For standalone, by default, a Spark application will 
> acquire all available cores in the cluster, generally meaning there is at 
> least one executor on each node, in which case task data locality can work 
> because a task can be dispatched to an executor on any of the preferred nodes 
> of the task for execution.
>
> for your case, have you set spark.cores.max to limit the cores to acquire, 
> which means executors are available on a subset of the cluster nodes?
>
>> On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote:
>>
>> Hi,
>>
>> I am running a couple of docker hosts, each with an HDFS and a spark
>> worker in a spark standalone cluster.
>> In order to get data locality awareness, I would like to configure Racks
>> for each host, so that a spark worker container knows from which hdfs
>> node container it should load its data. Does this make sense?
>>
>> I configured HDFS container nodes via the core-site.xml in
>> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
>> setup.
>>
>> I configured SPARK the same way. I placed core-site.xml and
>> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.
>>
>> Submitting a spark job via spark-submit to the spark-master that loads
>> from HDFS just has Data locality ANY.
>>
>> It would be great if anybody would help me getting the right configuration!
>>
>> Thanks and best regards,
>> on
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark 2.0.2 HDFS]: no data locality

Reply via email to