Oh good to know about the multi-layer proposal. Can you help share a link to it if there's any? I will also draft a short proposal on the manifest-level stats topic in a Google doc so that folks can review and comment.
Thank you Yufei for your time and input. On Thu, Sep 26, 2024 at 4:18 PM Yufei Gu <flyrain...@gmail.com> wrote: > I agree, this approach makes sense and could align well with the > multi-layer manifest file proposal. Each layer's manifest file could > potentially hold aggregated metrics, which would streamline the process. > However, so far, there have only been offline discussions, and no formal > proposal has been drafted yet. That said, it seems like a strong candidate > for inclusion in the v4 spec. > > Yufei > > > On Thu, Sep 26, 2024 at 4:01 PM Xingyuan Lin <linxingyuan1...@gmail.com> > wrote: > >> Thanks Yufei for taking a look. >> >> Yes I think adding the min/max values to partition-level statistics will >> also do. In fact, it has been proposed by [1]. However, my concern was that >> calculating partition-level min/max values would be an expensive operation >> because of the row-level deletes support ([2]). Instead of processing >> metadata files, the engine must process data files to get accurate >> partition-level min/max values. If we do want to have partition-level >> min/max values, it's better to have an incremental utility method to read >> in existing results and compute incrementally -- "incrementally" means >> that, don't compute the min/max stats of the partitions which don't change >> from the last snapshot. >> >> The alternative solution is more lightweight -- storing min/max stats per >> manifest file (only manifest_file whose content is "data" instead of >> "deletes"), and writers can do this in place -- when they commit a table. >> However, the downside is that the size of the *manifest list file* will >> increase. Not sure if this change is too much to be brought in. >> >> Thanks, >> Xingyuan >> >> [1] https://github.com/apache/iceberg/issues/11083 >> [2] >> https://github.com/apache/iceberg/blob/95497abe5579cf492f24ac8c470c7853d59332e9/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L49 >> >> On Thu, Sep 26, 2024 at 2:57 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> Hi Xingyuan, >>> I've been reviewing the partition statistics file, and it seems that >>> adding partition-level min/max values would be a natural fit within >>> Partition Statistics File[1], which is one file per snapshot. We could >>> introduce a few new fields to accommodate these values. >>> >>> While this addition could increase the size of the partition statistics >>> file, given that it’s stored in Parquet format, I believe the impact on the >>> reader will be minimal. On the writer's side, collecting these metrics may >>> require some additional work, but since they are generated asynchronously, >>> it should be manageable. >>> >>> One minor limitation is that the async generation may not always reflect >>> the latest snapshot, but overall, I think this approach should work well. >>> >>> [1] https://iceberg.apache.org/spec/#partition-statistics-file >>> >>> Yufei >>> >>> >>> On Thu, Sep 26, 2024 at 9:59 AM Xingyuan Lin <linxingyuan1...@gmail.com> >>> wrote: >>> >>>> Hi team, >>>> >>>> Just bumping this up. What do you think of this? Does the alternative >>>> solution make sense or is it too much of a spec change? >>>> >>>> Goal is to improve engine CBO's efficiency and effectiveness. Today, >>>> it's fairly an expensive operation for engine CBO to get table stats: >>>> https://github.com/trinodb/trino/blob/117c74e54131b47adf8c74ee1b6d98bafd1fe859/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158-L172. >>>> In one of our production environments (Trino 423 + Iceberg 1.3.0), if we >>>> turn on CBO, we could see some queries' planning time getting exceptionally >>>> long, even time out in some cases (exceeding 180seconds), as well as seeing >>>> memory pressure on Trino coordinator. There have been many improvements in >>>> the Trino community to improve the status quo. But a more radical solution >>>> would be to reduce the amount of computation needed to get table stats >>>> during the CBO process. >>>> >>>> Thanks, >>>> Xingyuan >>>> >>>> On Mon, Sep 23, 2024 at 8:32 PM Xingyuan Lin <linxingyuan1...@gmail.com> >>>> wrote: >>>> >>>>> Hi Iceberg dev team, >>>>> >>>>> Cost-based optimizers need these stats: >>>>> >>>>> - record count >>>>> - null count >>>>> - number of distinct values (NDVs) >>>>> - min/max values >>>>> - column sizes >>>>> >>>>> Today, to get these stats, an engine must process manifest files, >>>>> which can be an expensive operation when the table is large (has a lot of >>>>> data files, hence resulting in large-scale manifest files). >>>>> >>>>> There has been a proposal to add partition-level stats to improve >>>>> this: https://github.com/apache/iceberg/issues/8450. However, the >>>>> current solution doesn't include min/max values, etc. yet. Furthermore, >>>>> because of row-level delete support, it seems a non-trivial operation to >>>>> calculate partition-level min/max values. To make them accurate, it seems >>>>> that an engine must process data files. >>>>> >>>>> I think an alternative solution is to add these stats at manifest file >>>>> level, so that an engine only needs to process a single manifest list file >>>>> to get them. Furthermore, since estimated stats are good enough for CBOs, >>>>> the engines can ignore manifest files that track delete files. >>>>> >>>>> Another benefit of this alternative solution is that it's a >>>>> lightweight additive operation to update manifest-file-level stats, and >>>>> the >>>>> writers can do this when they ingest data. >>>>> >>>>> What do you think? >>>>> >>>>> Thanks, >>>>> Xingyuan >>>>> >>>>