Thanks Yufei for taking a look.

Yes I think adding the min/max values to partition-level statistics will
also do. In fact, it has been proposed by [1]. However, my concern was that
calculating partition-level min/max values would be an expensive operation
because of the row-level deletes support ([2]). Instead of processing
metadata files, the engine must process data files to get accurate
partition-level min/max values. If we do want to have partition-level
min/max values, it's better to have an incremental utility method to read
in existing results and compute incrementally -- "incrementally" means
that, don't compute the min/max stats of the partitions which don't change
from the last snapshot.

The alternative solution is more lightweight -- storing min/max stats per
manifest file (only manifest_file whose content is "data" instead of
"deletes"), and writers can do this in place -- when they commit a table.
However, the downside is that the size of the *manifest list file* will
increase. Not sure if this change is too much to be brought in.

Thanks,
Xingyuan

[1] https://github.com/apache/iceberg/issues/11083
[2]
https://github.com/apache/iceberg/blob/95497abe5579cf492f24ac8c470c7853d59332e9/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L49

On Thu, Sep 26, 2024 at 2:57 PM Yufei Gu <flyrain...@gmail.com> wrote:

> Hi Xingyuan,
> I've been reviewing the partition statistics file, and it seems that
> adding partition-level min/max values would be a natural fit within
> Partition Statistics File[1], which is one file per snapshot. We could
> introduce a few new fields to accommodate these values.
>
> While this addition could increase the size of the partition statistics
> file, given that it’s stored in Parquet format, I believe the impact on the
> reader will be minimal. On the writer's side, collecting these metrics may
> require some additional work, but since they are generated asynchronously,
> it should be manageable.
>
> One minor limitation is that the async generation may not always reflect
> the latest snapshot, but overall, I think this approach should work well.
>
> [1] https://iceberg.apache.org/spec/#partition-statistics-file
>
> Yufei
>
>
> On Thu, Sep 26, 2024 at 9:59 AM Xingyuan Lin <linxingyuan1...@gmail.com>
> wrote:
>
>> Hi team,
>>
>> Just bumping this up. What do you think of this? Does the alternative
>> solution make sense or is it too much of a spec change?
>>
>> Goal is to improve engine CBO's efficiency and effectiveness. Today, it's
>> fairly an expensive operation for engine CBO to get table stats:
>> https://github.com/trinodb/trino/blob/117c74e54131b47adf8c74ee1b6d98bafd1fe859/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158-L172.
>> In one of our production environments (Trino 423 + Iceberg 1.3.0), if we
>> turn on CBO, we could see some queries' planning time getting exceptionally
>> long, even time out in some cases (exceeding 180seconds), as well as seeing
>> memory pressure on Trino coordinator. There have been many improvements in
>> the Trino community to improve the status quo. But a more radical solution
>> would be to reduce the amount of computation needed to get table stats
>> during the CBO process.
>>
>> Thanks,
>> Xingyuan
>>
>> On Mon, Sep 23, 2024 at 8:32 PM Xingyuan Lin <linxingyuan1...@gmail.com>
>> wrote:
>>
>>> Hi Iceberg dev team,
>>>
>>> Cost-based optimizers need these stats:
>>>
>>>    - record count
>>>    - null count
>>>    - number of distinct values (NDVs)
>>>    - min/max values
>>>    - column sizes
>>>
>>> Today, to get these stats, an engine must process manifest files, which
>>> can be an expensive operation when the table is large (has a lot of data
>>> files, hence resulting in large-scale manifest files).
>>>
>>> There has been a proposal to add partition-level stats to improve this:
>>> https://github.com/apache/iceberg/issues/8450. However, the current
>>> solution doesn't include min/max values, etc. yet. Furthermore, because of
>>> row-level delete support, it seems a non-trivial operation to calculate
>>> partition-level min/max values. To make them accurate, it seems that an
>>> engine must process data files.
>>>
>>> I think an alternative solution is to add these stats at manifest file
>>> level, so that an engine only needs to process a single manifest list file
>>> to get them. Furthermore, since estimated stats are good enough for CBOs,
>>> the engines can ignore manifest files that track delete files.
>>>
>>> Another benefit of this alternative solution is that it's a lightweight
>>> additive operation to update manifest-file-level stats, and the writers can
>>> do this when they ingest data.
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> Xingyuan
>>>
>>

Reply via email to