Here's a class which lets you proved a function on a row by row basis to declare location
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala needs to be in o.a.spark as something you need is scoped to the spark packages only. I used it for a PoC of a distcp replacement -each row was a filename, so the locations of each row was the server with the first block of the file https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137 it would be convenient if either the bits of the API I needed was public or the extra RDD code just went in somewhere. It's nothing complicated On Thu, 4 Jun 2020 at 09:31, ZHANG Wei <wezh...@outlook.com> wrote: > AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()` > method, which is ordered by the data size, to get the partition > preferred locations. If there are other vectors to sort, I'm wondering > if here[1] can be a place to add. Or inheriting class `FilePartition` > with overridden `preferredLocations()` might also work. > > -- > Cheers, > -z > [1] > https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43 > > On Thu, 4 Jun 2020 06:40:43 +0000 > Nasrulla Khan Haris <nasrulla.k...@microsoft.com.INVALID> wrote: > > > HI Spark developers, > > > > I have created new format extending fileformat. I see > getPrefferedLocations is available if newCustomRDD is created. Since > fileformat is based off FileScanRDD which uses readfile method to read > partitioned file, Is there a way to add desired preferredLocations ? > > > > Appreciate your responses. > > > > Thanks, > > NKH > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >