Thanks Yufei for taking a look. Yes I think adding the min/max values to partition-level statistics will also do. In fact, it has been proposed by [1]. However, my concern was that calculating partition-level min/max values would be an expensive operation because of the row-level deletes support ([2]). Instead of processing metadata files, the engine must process data files to get accurate partition-level min/max values. If we do want to have partition-level min/max values, it's better to have an incremental utility method to read in existing results and compute incrementally -- "incrementally" means that, don't compute the min/max stats of the partitions which don't change from the last snapshot.
The alternative solution is more lightweight -- storing min/max stats per manifest file (only manifest_file whose content is "data" instead of "deletes"), and writers can do this in place -- when they commit a table. However, the downside is that the size of the *manifest list file* will increase. Not sure if this change is too much to be brought in. Thanks, Xingyuan [1] https://github.com/apache/iceberg/issues/11083 [2] https://github.com/apache/iceberg/blob/95497abe5579cf492f24ac8c470c7853d59332e9/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L49 On Thu, Sep 26, 2024 at 2:57 PM Yufei Gu <flyrain...@gmail.com> wrote: > Hi Xingyuan, > I've been reviewing the partition statistics file, and it seems that > adding partition-level min/max values would be a natural fit within > Partition Statistics File[1], which is one file per snapshot. We could > introduce a few new fields to accommodate these values. > > While this addition could increase the size of the partition statistics > file, given that it’s stored in Parquet format, I believe the impact on the > reader will be minimal. On the writer's side, collecting these metrics may > require some additional work, but since they are generated asynchronously, > it should be manageable. > > One minor limitation is that the async generation may not always reflect > the latest snapshot, but overall, I think this approach should work well. > > [1] https://iceberg.apache.org/spec/#partition-statistics-file > > Yufei > > > On Thu, Sep 26, 2024 at 9:59 AM Xingyuan Lin <linxingyuan1...@gmail.com> > wrote: > >> Hi team, >> >> Just bumping this up. What do you think of this? Does the alternative >> solution make sense or is it too much of a spec change? >> >> Goal is to improve engine CBO's efficiency and effectiveness. Today, it's >> fairly an expensive operation for engine CBO to get table stats: >> https://github.com/trinodb/trino/blob/117c74e54131b47adf8c74ee1b6d98bafd1fe859/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java#L158-L172. >> In one of our production environments (Trino 423 + Iceberg 1.3.0), if we >> turn on CBO, we could see some queries' planning time getting exceptionally >> long, even time out in some cases (exceeding 180seconds), as well as seeing >> memory pressure on Trino coordinator. There have been many improvements in >> the Trino community to improve the status quo. But a more radical solution >> would be to reduce the amount of computation needed to get table stats >> during the CBO process. >> >> Thanks, >> Xingyuan >> >> On Mon, Sep 23, 2024 at 8:32 PM Xingyuan Lin <linxingyuan1...@gmail.com> >> wrote: >> >>> Hi Iceberg dev team, >>> >>> Cost-based optimizers need these stats: >>> >>> - record count >>> - null count >>> - number of distinct values (NDVs) >>> - min/max values >>> - column sizes >>> >>> Today, to get these stats, an engine must process manifest files, which >>> can be an expensive operation when the table is large (has a lot of data >>> files, hence resulting in large-scale manifest files). >>> >>> There has been a proposal to add partition-level stats to improve this: >>> https://github.com/apache/iceberg/issues/8450. However, the current >>> solution doesn't include min/max values, etc. yet. Furthermore, because of >>> row-level delete support, it seems a non-trivial operation to calculate >>> partition-level min/max values. To make them accurate, it seems that an >>> engine must process data files. >>> >>> I think an alternative solution is to add these stats at manifest file >>> level, so that an engine only needs to process a single manifest list file >>> to get them. Furthermore, since estimated stats are good enough for CBOs, >>> the engines can ignore manifest files that track delete files. >>> >>> Another benefit of this alternative solution is that it's a lightweight >>> additive operation to update manifest-file-level stats, and the writers can >>> do this when they ingest data. >>> >>> What do you think? >>> >>> Thanks, >>> Xingyuan >>> >>