+1 for this feature of column stats projection.

I will add some additional inputs.

1)  In the previous discussion, there are comments on only enabling column
stats that are needed. That is definitely a recommended best practice. But
there are some practical challenges. By default, Iceberg enables column
stats for up to 100 columns. It also may be hard to know all the queries
and what column stats are needed up front. And it is expensive to
retrofit/backfill column stats after the fact, as it requires data file
scan and rewrite.

2) We also have the Trino coordinator run into OOM due to the column stats
taking up a lot of memory space. If the Trino engine can select the column
stats based on join and filter expressions, we can greatly reduce the
memory footprint per split.

Thanks,
Steven


On Thu, Oct 5, 2023 at 3:32 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Team,
>
> TL;DR: I would like to introduce the possibility to parametrize the
> Iceberg table scans to include the metadata metrics only for
> specific columns.
>
> We discussed this previously on the mailing list [1], but we did not
> finalize the direction there.
>
> *To recap*
> Currently there are two ways to affect which column metrics are returned
> from the metadata by the scan tasks:
>
>    - Decide which column stats are collected/stored -
>    write.metadata.metrics.* write configuration could be used
>    - Decide if the scan should include the column stats -
>    includeColumnStats - scan builder configuration
>
> In our experience this granularity is not enough if the table has many
> columns and multiple readers. Some readers might want to have a specific
> column statistics only, while other readers might need metrics for another
> column.
>
> *Current issue*
> As for the concrete case, I have a PR [1] under review to emit watermarks
> when reading an Iceberg table by the Flink Source. It would be good to
> provide a simple "watermarkFieldName" for the users, which would be used
> for generating the watermark.
> The issue is that for the simple implementation we would need the column
> metadata metrics to extract the watermark for the *FileScanTasks*. If we
> include all of the column statistics, then it could easily cause memory
> issues on the JobManager side. This is why we are reluctant to make this as
> a simple switch on the user side, and might opt for a more heavy extractor
> interface where the user has the responsibility to extract the watermark
> themselves.
> OTOH, if it would be possible to return column metrics only for the given
> column, the memory pressure would be more predictable, and we could have a
> more user-friendly solution.
>
> *Suggested solution*
> I would like to introduce a new method to the Scan class, like:
>
> *ThisT includeColumnStats(Collection<String> columns);*
>
>
> Using this configuration the users of the Scan task could decide which
> column metrics are retained during the planning process, and which column
> metrics are thrown away in the results.
>
> Would the community consider this as a valuable addition to the Scan API?
>
> Thanks,
> Peter
>
> - [1] https://lists.apache.org/thread/9rl8yq8ps3mfg91g1qvzvgd0tnkjvxgg -
> Scan statistics thread
> - [2] https://github.com/apache/iceberg/pull/8553 - Flink: Emit
> watermarks from the IcebergSource
>

Reply via email to