Scan statistics

Péter Váry Mon, 15 May 2023 07:49:43 -0700

Hi Team,

We have a Flink job where we would like to use the Iceberg File statistics
(lowerBounds, upperBounds) during the planning phase.


Currently it is possible to parameterize the Scan to include the statistics
using the includeColumnStats [1]. This is an on/off switch, but currently
there is no way to configure this on a finer granularity.

Sadly our table has plenty of columns and requesting statistics for every
column will result in GenericDataFiles objects where the retained heap is
~100k each. We have a few thousand data files and requesting statistics for
them would add serious extra memory load to our job.

I was considering adding a new method to the Scan class like this:
---------
ThisT includeColumnStats(Collection<String> columns);
---------

Would the community consider this as a valuable addition to the Scan API?

Thanks,
Peter

[1]
https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78

Scan statistics

Reply via email to