Hi Gautam,

Iceberg does support nested schema pruning but Spark doesn’t request this for 
DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this work 
end-to-end.
One of the options is to extend DataSourceV2Strategy with logic similar to what 
we have in ParquetSchemaPruning in 2.4.0. I think we can share that part if 
needed.

I am planning to check whether Spark master already has this functionality.
If that’s not implemented and nobody is working on it yet, I can fix it.

- Anton


> On 30 Aug 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote:
> 
> Hello Devs,
>                     I was measuring perf on structs between V1 and V2 
> datasources. Found that although Iceberg Reader supports 
> `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column 
> projections. I want to be able to prune on nested fields. How does V2 
> datasource have provision to be able to let Iceberg decide this? The 
> `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even 
> if a sub-field is requested.
> 
> Here's an illustration .. 
> 
> scala> spark.sql("select location.lat from iceberg_people_struct").show()
> +-------+
> |    lat|
> +-------+
> |   null|
> |101.123|
> |175.926|
> +-------+
> 
> 
> The pruning gets the entire struct instead of just `location.lat`  ..
> 
> public void pruneColumns(StructType newRequestedSchema) 
> 
> 19/08/30 16:25:38 WARN Reader: => Prune columns : {
>   "type" : "struct",
>   "fields" : [ {
>     "name" : "location",
>     "type" : {
>       "type" : "struct",
>       "fields" : [ {
>         "name" : "lat",
>         "type" : "double",
>         "nullable" : true,
>         "metadata" : { }
>       }, {
>         "name" : "lon",
>         "type" : "double",
>         "nullable" : true,
>         "metadata" : { }
>       } ]
>     },
>     "nullable" : true,
>     "metadata" : { }
>   } ]
> }
> 
> Is there information I can use in the IcebergSource (or add some) that can be 
> used to prune the exact sub-field here?  What's a good way to approach this? 
> For dense/wide struct fields this affects performance significantly.
>  
> 
> Sample gist: https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
> 
> 
> thanks and regards,
> -Gautam.

Reply via email to