It would be great to get this into Spark master. I think it would make the DSv2 path more valuable before the 3.0 release!
On Fri, Aug 30, 2019 at 9:58 PM Gautam Kowshik <gautamkows...@gmail.com> wrote: > Super! That’d be great. Lemme know if I can help in any way. > > Sent from my iPhone > > > On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > > > > Hi Gautam, > > > > Iceberg does support nested schema pruning but Spark doesn’t request > this for DS V2 in 2.4. Internally, we had to modify Spark 2.4 to make this > work end-to-end. > > One of the options is to extend DataSourceV2Strategy with logic similar > to what we have in ParquetSchemaPruning in 2.4.0. I think we can share that > part if needed. > > > > I am planning to check whether Spark master already has this > functionality. > > If that’s not implemented and nobody is working on it yet, I can fix it. > > > > - Anton > > > > > >> On 30 Aug 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote: > >> > >> Hello Devs, > >> I was measuring perf on structs between V1 and V2 > datasources. Found that although Iceberg Reader supports > `SupportsPushDownRequiredColumns` it doesn't seem to prune nested column > projections. I want to be able to prune on nested fields. How does V2 > datasource have provision to be able to let Iceberg decide this? The > `SupportsPushDownRequiredColumns` mix-in gives the entire struct field even > if a sub-field is requested. > >> > >> Here's an illustration .. > >> > >> scala> spark.sql("select location.lat from > iceberg_people_struct").show() > >> +-------+ > >> | lat| > >> +-------+ > >> | null| > >> |101.123| > >> |175.926| > >> +-------+ > >> > >> > >> The pruning gets the entire struct instead of just `location.lat` .. > >> > >> public void pruneColumns(StructType newRequestedSchema) > >> > >> 19/08/30 16:25:38 WARN Reader: => Prune columns : { > >> "type" : "struct", > >> "fields" : [ { > >> "name" : "location", > >> "type" : { > >> "type" : "struct", > >> "fields" : [ { > >> "name" : "lat", > >> "type" : "double", > >> "nullable" : true, > >> "metadata" : { } > >> }, { > >> "name" : "lon", > >> "type" : "double", > >> "nullable" : true, > >> "metadata" : { } > >> } ] > >> }, > >> "nullable" : true, > >> "metadata" : { } > >> } ] > >> } > >> > >> Is there information I can use in the IcebergSource (or add some) that > can be used to prune the exact sub-field here? What's a good way to > approach this? For dense/wide struct fields this affects performance > significantly. > >> > >> > >> Sample gist: > https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac > >> > >> > >> thanks and regards, > >> -Gautam. > > > -- Ryan Blue Software Engineer Netflix