I think currently the recommendation would be to filter the iterator rather than pulling the whole object with stat's into memory. Is there a requirement that all of the DataFiles be pulled into memory before filtering?
On Mon, May 15, 2023 at 9:49 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Team, > > We have a Flink job where we would like to use the Iceberg File statistics > (lowerBounds, upperBounds) during the planning phase. > > Currently it is possible to parameterize the Scan to include the > statistics using the includeColumnStats [1]. This is an on/off switch, but > currently there is no way to configure this on a finer granularity. > > Sadly our table has plenty of columns and requesting statistics for every > column will result in GenericDataFiles objects where the retained heap is > ~100k each. We have a few thousand data files and requesting statistics for > them would add serious extra memory load to our job. > > I was considering adding a new method to the Scan class like this: > --------- > ThisT includeColumnStats(Collection<String> columns); > --------- > > Would the community consider this as a valuable addition to the Scan API? > > Thanks, > Peter > > [1] > https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78 >