I agree, this approach makes sense and could align well with the
multi-layer manifest file proposal. Each layer's manifest file could
potentially hold aggregated metrics, which would streamline the process.
However, so far, there have only been offline discussions, and no formal
proposal has been drafted yet. That said, it seems like a strong candidate
for inclusion in the v4 spec.

Yufei


On Thu, Sep 26, 2024 at 4:01 PM Xingyuan Lin <linxingyuan1...@gmail.com>
wrote:

> Thanks Yufei for taking a look.
>
> Yes I think adding the min/max values to partition-level statistics will
> also do. In fact, it has been proposed by [1]. However, my concern was that
> calculating partition-level min/max values would be an expensive operation
> because of the row-level deletes support ([2]). Instead of processing
> metadata files, the engine must process data files to get accurate
> partition-level min/max values. If we do want to have partition-level
> min/max values, it's better to have an incremental utility method to read
> in existing results and compute incrementally -- "incrementally" means
> that, don't compute the min/max stats of the partitions which don't change
> from the last snapshot.
>
> The alternative solution is more lightweight -- storing min/max stats per
> manifest file (only manifest_file whose content is "data" instead of
> "deletes"), and writers can do this in place -- when they commit a table.
> However, the downside is that the size of the *manifest list file* will
> increase. Not sure if this change is too much to be brought in.
>
> Thanks,
> Xingyuan
>
> [1] https://github.com/apache/iceberg/issues/11083
> [2]
> https://github.com/apache/iceberg/blob/95497abe5579cf492f24ac8c470c7853d59332e9/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L49
>
> On Thu, Sep 26, 2024 at 2:57 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Hi Xingyuan,
>> I've been reviewing the partition statistics file, and it seems that
>> adding partition-level min/max values would be a natural fit within
>> Partition Statistics File[1], which is one file per snapshot. We could
>> introduce a few new fields to accommodate these values.
>>
>> While this addition could increase the size of the partition statistics
>> file, given that it’s stored in Parquet format, I believe the impact on the
>> reader will be minimal. On the writer's side, collecting these metrics may
>> require some additional work, but since they are generated asynchronously,
>> it should be manageable.
>>
>> One minor limitation is that the async generation may not always reflect
>> the latest snapshot, but overall, I think this approach should work well.
>>
>> [1] https://iceberg.apache.org/spec/#partition-statistics-file
>>
>> Yufei
>>
>>
>> On Thu, Sep 26, 2024 at 9:59 AM Xingyuan Lin <linxingyuan1...@gmail.com>
>> wrote:
>>
>>> Hi team,
>>>
>>> Just bumping this up. What do you think of this? Does the alternative
>>> solution make sense or is it too much of a spec change?
>>>
>>> Goal is to improve engine CBO's efficiency and effectiveness. Today,
>>> it's fairly an expensive operation for engine CBO to get table stats:
>>> https://github.com/trinodb/trino/blob/117c74e54131b47adf8c74ee1b6d98bafd1fe859/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158-L172.
>>> In one of our production environments (Trino 423 + Iceberg 1.3.0), if we
>>> turn on CBO, we could see some queries' planning time getting exceptionally
>>> long, even time out in some cases (exceeding 180seconds), as well as seeing
>>> memory pressure on Trino coordinator. There have been many improvements in
>>> the Trino community to improve the status quo. But a more radical solution
>>> would be to reduce the amount of computation needed to get table stats
>>> during the CBO process.
>>>
>>> Thanks,
>>> Xingyuan
>>>
>>> On Mon, Sep 23, 2024 at 8:32 PM Xingyuan Lin <linxingyuan1...@gmail.com>
>>> wrote:
>>>
>>>> Hi Iceberg dev team,
>>>>
>>>> Cost-based optimizers need these stats:
>>>>
>>>>    - record count
>>>>    - null count
>>>>    - number of distinct values (NDVs)
>>>>    - min/max values
>>>>    - column sizes
>>>>
>>>> Today, to get these stats, an engine must process manifest files, which
>>>> can be an expensive operation when the table is large (has a lot of data
>>>> files, hence resulting in large-scale manifest files).
>>>>
>>>> There has been a proposal to add partition-level stats to improve this:
>>>> https://github.com/apache/iceberg/issues/8450. However, the current
>>>> solution doesn't include min/max values, etc. yet. Furthermore, because of
>>>> row-level delete support, it seems a non-trivial operation to calculate
>>>> partition-level min/max values. To make them accurate, it seems that an
>>>> engine must process data files.
>>>>
>>>> I think an alternative solution is to add these stats at manifest file
>>>> level, so that an engine only needs to process a single manifest list file
>>>> to get them. Furthermore, since estimated stats are good enough for CBOs,
>>>> the engines can ignore manifest files that track delete files.
>>>>
>>>> Another benefit of this alternative solution is that it's a lightweight
>>>> additive operation to update manifest-file-level stats, and the writers can
>>>> do this when they ingest data.
>>>>
>>>> What do you think?
>>>>
>>>> Thanks,
>>>> Xingyuan
>>>>
>>>

Reply via email to