How to achieve co-location of task and source data

We are developing an RDD (and later a DataSource on top of it) to access distributed data in our Spark cluster and want to achive co-location of tasks working on the data with their source data partitions.

Overriding RDD.getPreferredLocations should be the way to achieve that, so each RDD partition can indicate on which server it should be processed. Unfortunately, there seems to be no clearly defined way how Spark identifies the server on which an Executor is running: With the mesos cluster manager, Spark tracks Executors by fully qualified host name, with standalone cluster manager, it used the IP address in Spark 1.5. In Spark 1.6 this seems to have changed to a host name.

The code in the Spark task scheduler that matches preferred locations to Executors does a map lookup and requires an exact textual match of the string specified as preferred location with the host name provided by the Executor. So, if the formats don't match, task locality handling does not work at all, but there seems to be no "standard" format for the location. So how can one write a custom RDD overriding getPreferredLocations that will work without specific dependencies on a concrete Spark setup?

There seems to be no way for user code to get access to the internal scheduler info tracking executors by host. It seems that even the host name reported to a SparkListener for ExecutorAdded is not reliably the same value that the scheduler uses internally for lookup.

How to achieve co-location of task and source data

Reply via email to