Oh good to know about the multi-layer proposal. Can you help share a link
to it if there's any? I will also draft a short proposal on the
manifest-level stats topic in a Google doc so that folks can review and
comment.

Thank you Yufei for your time and input.

On Thu, Sep 26, 2024 at 4:18 PM Yufei Gu <flyrain...@gmail.com> wrote:

> I agree, this approach makes sense and could align well with the
> multi-layer manifest file proposal. Each layer's manifest file could
> potentially hold aggregated metrics, which would streamline the process.
> However, so far, there have only been offline discussions, and no formal
> proposal has been drafted yet. That said, it seems like a strong candidate
> for inclusion in the v4 spec.
>
> Yufei
>
>
> On Thu, Sep 26, 2024 at 4:01 PM Xingyuan Lin <linxingyuan1...@gmail.com>
> wrote:
>
>> Thanks Yufei for taking a look.
>>
>> Yes I think adding the min/max values to partition-level statistics will
>> also do. In fact, it has been proposed by [1]. However, my concern was that
>> calculating partition-level min/max values would be an expensive operation
>> because of the row-level deletes support ([2]). Instead of processing
>> metadata files, the engine must process data files to get accurate
>> partition-level min/max values. If we do want to have partition-level
>> min/max values, it's better to have an incremental utility method to read
>> in existing results and compute incrementally -- "incrementally" means
>> that, don't compute the min/max stats of the partitions which don't change
>> from the last snapshot.
>>
>> The alternative solution is more lightweight -- storing min/max stats per
>> manifest file (only manifest_file whose content is "data" instead of
>> "deletes"), and writers can do this in place -- when they commit a table.
>> However, the downside is that the size of the *manifest list file* will
>> increase. Not sure if this change is too much to be brought in.
>>
>> Thanks,
>> Xingyuan
>>
>> [1] https://github.com/apache/iceberg/issues/11083
>> [2]
>> https://github.com/apache/iceberg/blob/95497abe5579cf492f24ac8c470c7853d59332e9/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L49
>>
>> On Thu, Sep 26, 2024 at 2:57 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> Hi Xingyuan,
>>> I've been reviewing the partition statistics file, and it seems that
>>> adding partition-level min/max values would be a natural fit within
>>> Partition Statistics File[1], which is one file per snapshot. We could
>>> introduce a few new fields to accommodate these values.
>>>
>>> While this addition could increase the size of the partition statistics
>>> file, given that it’s stored in Parquet format, I believe the impact on the
>>> reader will be minimal. On the writer's side, collecting these metrics may
>>> require some additional work, but since they are generated asynchronously,
>>> it should be manageable.
>>>
>>> One minor limitation is that the async generation may not always reflect
>>> the latest snapshot, but overall, I think this approach should work well.
>>>
>>> [1] https://iceberg.apache.org/spec/#partition-statistics-file
>>>
>>> Yufei
>>>
>>>
>>> On Thu, Sep 26, 2024 at 9:59 AM Xingyuan Lin <linxingyuan1...@gmail.com>
>>> wrote:
>>>
>>>> Hi team,
>>>>
>>>> Just bumping this up. What do you think of this? Does the alternative
>>>> solution make sense or is it too much of a spec change?
>>>>
>>>> Goal is to improve engine CBO's efficiency and effectiveness. Today,
>>>> it's fairly an expensive operation for engine CBO to get table stats:
>>>> https://github.com/trinodb/trino/blob/117c74e54131b47adf8c74ee1b6d98bafd1fe859/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158-L172.
>>>> In one of our production environments (Trino 423 + Iceberg 1.3.0), if we
>>>> turn on CBO, we could see some queries' planning time getting exceptionally
>>>> long, even time out in some cases (exceeding 180seconds), as well as seeing
>>>> memory pressure on Trino coordinator. There have been many improvements in
>>>> the Trino community to improve the status quo. But a more radical solution
>>>> would be to reduce the amount of computation needed to get table stats
>>>> during the CBO process.
>>>>
>>>> Thanks,
>>>> Xingyuan
>>>>
>>>> On Mon, Sep 23, 2024 at 8:32 PM Xingyuan Lin <linxingyuan1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Iceberg dev team,
>>>>>
>>>>> Cost-based optimizers need these stats:
>>>>>
>>>>>    - record count
>>>>>    - null count
>>>>>    - number of distinct values (NDVs)
>>>>>    - min/max values
>>>>>    - column sizes
>>>>>
>>>>> Today, to get these stats, an engine must process manifest files,
>>>>> which can be an expensive operation when the table is large (has a lot of
>>>>> data files, hence resulting in large-scale manifest files).
>>>>>
>>>>> There has been a proposal to add partition-level stats to improve
>>>>> this: https://github.com/apache/iceberg/issues/8450. However, the
>>>>> current solution doesn't include min/max values, etc. yet. Furthermore,
>>>>> because of row-level delete support, it seems a non-trivial operation to
>>>>> calculate partition-level min/max values. To make them accurate, it seems
>>>>> that an engine must process data files.
>>>>>
>>>>> I think an alternative solution is to add these stats at manifest file
>>>>> level, so that an engine only needs to process a single manifest list file
>>>>> to get them. Furthermore, since estimated stats are good enough for CBOs,
>>>>> the engines can ignore manifest files that track delete files.
>>>>>
>>>>> Another benefit of this alternative solution is that it's a
>>>>> lightweight additive operation to update manifest-file-level stats, and 
>>>>> the
>>>>> writers can do this when they ingest data.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Thanks,
>>>>> Xingyuan
>>>>>
>>>>

Reply via email to