.. Just to be clear my concern is around Iceberg not skipping files. Iceberg does skip rowGroups when scanning files as *iceberg.parquet.ParquetReader* uses the parquet stats under it while skipping, albeit none of these stats come from the manifests.
On Tue, Feb 26, 2019 at 7:24 PM Gautam <[email protected]> wrote: > Hello Devs, > I am looking into leveraging Iceberg to speed up split > generation and to minimize file scans. My understanding was that Iceberg > keeps key statistics as listed under Metrics.java [1] viz. column > lower/upper bounds, nullValues, distinct value counts, etc. and that table > scanning leverages these to skip partitions, files & row-groups (in the > Parquet context). > > What I found is files aren't skipped when a predicate applies only to a > subset of the table's files. Within a partition it will scan all files as > manifests only keep record counts but the rest of the metrics (lower, > upper, distinct value counts, null values) are null / empty. This is coz > AvroFileAppender only keeps `recordCounts` as metrics [2].. And currently > that is the only appender supported for writing manifest files. > > > *Example :* > > In following example iceTable was generated by iteratively adding two > files so it has two separate parquet files under it .. > > scala> iceTable.newScan().planFiles.asScala.foreach(fl => println(fl)) > > BaseFileScanTask{file=file:/usr/local/spark/test/iceberg-people/data/00000-1-8d8c9ecf-e1fa-4bdb-bdb4-1e9b5f4b71dc.parquet, > partition_data=PartitionData{}, residual=true} > BaseFileScanTask{file=file:/usr/local/spark/test/iceberg-people/data/00000-0-82ae5672-20bf-4e76-bf76-130623606a72.parquet, > partition_data=PartitionData{}, residual=true} > > > *Only one file contains row with age = null .. * > > scala> iceDf.show() > 19/02/26 13:30:46 WARN scheduler.TaskSetManager: Stage 3 contains a task > of very large size (113 KB). The maximum recommended task size is 100 KB. > +----+-------+--------------------+ > | age| name| friends| > +----+-------+--------------------+ > | 60| Kannan| [Justin -> 19]| > | 75| Sharon|[Michael -> 30, J...| > |null|Michael| null| > | 30| Andy|[Josh -> 10, Bisw...| > | 19| Justin|[Kannan -> 75, Sa...| > +----+-------+--------------------+ > > > > *Running filter on isNull(age) scans both files .. * > > val isNullExp = Expressions.isNull("age") > val isNullScan = iceTable.newScan().filter(isNullExp) > > scala> isNullScan.planFiles.asScala.foreach(fl => println(fl)) > > BaseFileScanTask{file=file:/usr/local/spark/test/iceberg-people/data/00000-1-8d8c9ecf-e1fa-4bdb-bdb4-1e9b5f4b71dc.parquet, > partition_data=PartitionData{}, residual=is_null(ref(name="age"))} > BaseFileScanTask{file=file:/usr/local/spark/test/iceberg-people/data/00000-0-82ae5672-20bf-4e76-bf76-130623606a72.parquet, > partition_data=PartitionData{}, residual=is_null(ref(name="age"))} > > > > I would expect only one file to be scanned as Iceberg should track > nullValueCounts as per Metrics.java [1] .. The same issue holds for integer > comparison filters scanning too many files. > > When I looked through the code, there is provision for using Parquet file > footer stats to populate Manifest Metrics [3] but this is never used as > Iceberg currently only allows AvroFileAppender for creating manifest files. > > What's the plan around using Parquet footer stats in manifests which can > be very useful during split generation? I saw some discussions around this > in the Iceberg Spec document [4] but couldn't glean if any of those are > actually implemented yet. > > I can work on a proposal PR for adding these in but wanted to know the > current thoughts around this. > > > *Gist for above example *: > https://gist.github.com/prodeezy/fe1b447c78c0bc9dc3be66272341d1a7 > > > Looking forward to your feedback, > > Cheers, > -Gautam. > > > > > > [1] - > https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/Metrics.java > [2] - > https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/com/netflix/iceberg/avro/AvroFileAppender.java#L56 > [3] - > https://github.com/apache/incubator-iceberg/blob/1bec13a954c29f8cd09719a0362c0b2829635c77/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetWriter.java#L118 > [4] - > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit# > >
