Hi Sun Rui, thanks for answering!
> Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. This explains why the script that I configured in core-site.xml topology.script.file.name is not called in by the spark container. But at time of reading from hdfs in a spark program, the script is called in my hdfs namenode container. > However, node-level locality can still work for Standalone. I have a couple of physical hosts that run spark and hdfs docker containers. How does spark standalone knows that spark and docker containers are on the same host? > Data Locality involves in both task data locality and executor data locality. > Executor data locality is only supported on YARN with executor dynamic > allocation enabled. For standalone, by default, a Spark application will > acquire all available cores in the cluster, generally meaning there is at > least one executor on each node, in which case task data locality can work > because a task can be dispatched to an executor on any of the preferred nodes > of the task for execution. > > for your case, have you set spark.cores.max to limit the cores to acquire, > which means executors are available on a subset of the cluster nodes? I set "--total-executor-cores 1" in order to use only a small subset of the cluster. On 28.12.2016 02:58, Sun Rui wrote: > Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. However, node-level > locality can still work for Standalone. > > It is not necessary to copy the hadoop config files into the Spark CONF > directory. Set HADOOP_CONF_DIR to point to the conf directory of your Hadoop. > > Data Locality involves in both task data locality and executor data locality. > Executor data locality is only supported on YARN with executor dynamic > allocation enabled. For standalone, by default, a Spark application will > acquire all available cores in the cluster, generally meaning there is at > least one executor on each node, in which case task data locality can work > because a task can be dispatched to an executor on any of the preferred nodes > of the task for execution. > > for your case, have you set spark.cores.max to limit the cores to acquire, > which means executors are available on a subset of the cluster nodes? > >> On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote: >> >> Hi, >> >> I am running a couple of docker hosts, each with an HDFS and a spark >> worker in a spark standalone cluster. >> In order to get data locality awareness, I would like to configure Racks >> for each host, so that a spark worker container knows from which hdfs >> node container it should load its data. Does this make sense? >> >> I configured HDFS container nodes via the core-site.xml in >> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my >> setup. >> >> I configured SPARK the same way. I placed core-site.xml and >> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect. >> >> Submitting a spark job via spark-submit to the spark-master that loads >> from HDFS just has Data locality ANY. >> >> It would be great if anybody would help me getting the right configuration! >> >> Thanks and best regards, >> on >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org