Re: Scan statistics

Ryan Blue Mon, 15 May 2023 09:08:03 -0700

Yes, I agree with Russell. You'd want to push the filter into planning
rather than returning stats. That's why we strip out stats when the file
metadata is copied. It also would be expensive to copy some, but not all of
the file stats. It's better not to store the stats you don't need.


What about using the ManifestGroup interface to get finer-grained control
of the planning?

Ryan

On Mon, May 15, 2023 at 8:05 AM Russell Spitzer <[email protected]>
wrote:

> I think currently the recommendation would be to filter the iterator
> rather than pulling the whole object with stat's into memory. Is there a
> requirement that all of the DataFiles be pulled into memory before
> filtering?
>
> On Mon, May 15, 2023 at 9:49 AM Péter Váry <[email protected]>
> wrote:
>
>> Hi Team,
>>
>> We have a Flink job where we would like to use the Iceberg File
>> statistics (lowerBounds, upperBounds) during the planning phase.
>>
>> Currently it is possible to parameterize the Scan to include the
>> statistics using the includeColumnStats [1]. This is an on/off switch, but
>> currently there is no way to configure this on a finer granularity.
>>
>> Sadly our table has plenty of columns and requesting statistics for every
>> column will result in GenericDataFiles objects where the retained heap is
>> ~100k each. We have a few thousand data files and requesting statistics for
>> them would add serious extra memory load to our job.
>>
>> I was considering adding a new method to the Scan class like this:
>> ---------
>> ThisT includeColumnStats(Collection<String> columns);
>> ---------
>>
>> Would the community consider this as a valuable addition to the Scan API?
>>
>> Thanks,
>> Peter
>>
>> [1]
>> https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78
>>
>

-- 
Ryan Blue
Tabular

Re: Scan statistics

Reply via email to