Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Gengliang Mon, 30 Dec 2019 13:17:16 -0800

Hi Guy,

Thanks for reporting the issue. I am working on it and there will be a PR
this week.


Gengliang

On Mon, Dec 30, 2019 at 6:41 AM Guy Khazma <guy.kha...@ibm.com> wrote:

> Hi,
>
> It seems that hive style partition pruning is not working for file based
> data sources such as Parquet and ORC.
> This causes serious performance degradation for non hive tables.
>
> The reason for that is that the  FileScan
> <
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala>
>
> abstract class is not aware of the partition and data filters.
> The method for getting the selectedPartitions calls the FileIndex listFiles
> method with empty sequence for both - see  here
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74>
>
> .
>
> In the v1 datasource the  FileSourceScanExec
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160>
>
> class gets the partition and data filters and use them to filter
> unnecessary
> partitions by passing them to the listFiles function - see  here
> <
> https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210>
>
> .
>
> Are there any ongoing plans to add a support for that?
>
> Thanks,
> Guy
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [Spark 3.0] DataSourceV2 FileScan - Hive style partition pruning

Reply via email to