I think currently the recommendation would be to filter the iterator rather
than pulling the whole object with stat's into memory. Is there a
requirement that all of the DataFiles be pulled into memory before
filtering?

On Mon, May 15, 2023 at 9:49 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Team,
>
> We have a Flink job where we would like to use the Iceberg File statistics
> (lowerBounds, upperBounds) during the planning phase.
>
> Currently it is possible to parameterize the Scan to include the
> statistics using the includeColumnStats [1]. This is an on/off switch, but
> currently there is no way to configure this on a finer granularity.
>
> Sadly our table has plenty of columns and requesting statistics for every
> column will result in GenericDataFiles objects where the retained heap is
> ~100k each. We have a few thousand data files and requesting statistics for
> them would add serious extra memory load to our job.
>
> I was considering adding a new method to the Scan class like this:
> ---------
> ThisT includeColumnStats(Collection<String> columns);
> ---------
>
> Would the community consider this as a valuable addition to the Scan API?
>
> Thanks,
> Peter
>
> [1]
> https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78
>

Reply via email to