Re: preferredlocations for hadoopfsrelations based baseRelations

Steve Loughran Mon, 29 Jun 2020 05:42:41 -0700

Here's a class which lets you proved a function on a row by row basis to
declare location


https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

needs to be in o.a.spark as something you need is scoped to the spark
packages only.

I used it for a PoC of a distcp replacement -each row was a filename, so
the locations of each row was the server with the first block of the file
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137

it would be convenient if either the bits of the API I needed was public or
the extra RDD code just went in somewhere. It's nothing complicated

On Thu, 4 Jun 2020 at 09:31, ZHANG Wei <wezh...@outlook.com> wrote:

> AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
> method, which is ordered by the data size, to get the partition
> preferred locations. If there are other vectors to sort, I'm wondering
> if here[1] can be a place to add. Or inheriting class `FilePartition`
> with overridden `preferredLocations()` might also work.
>
> --
> Cheers,
> -z
> [1]
> https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
>
> On Thu, 4 Jun 2020 06:40:43 +0000
> Nasrulla Khan Haris <nasrulla.k...@microsoft.com.INVALID> wrote:
>
> > HI Spark developers,
> >
> > I have created new format extending fileformat. I see
> getPrefferedLocations is available if newCustomRDD is created. Since
> fileformat is based off FileScanRDD which uses readfile method to read
> partitioned file, Is there a way to add desired preferredLocations ?
> >
> > Appreciate your responses.
> >
> > Thanks,
> > NKH
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: preferredlocations for hadoopfsrelations based baseRelations

Reply via email to