Hi Iceberg dev team,

Cost-based optimizers need these stats:

   - record count
   - null count
   - number of distinct values (NDVs)
   - min/max values
   - column sizes

Today, to get these stats, an engine must process manifest files, which can
be an expensive operation when the table is large (has a lot of data files,
hence resulting in large-scale manifest files).

There has been a proposal to add partition-level stats to improve this:
https://github.com/apache/iceberg/issues/8450. However, the current
solution doesn't include min/max values, etc. yet. Furthermore, because of
row-level delete support, it seems a non-trivial operation to calculate
partition-level min/max values. To make them accurate, it seems that an
engine must process data files.

I think an alternative solution is to add these stats at manifest file
level, so that an engine only needs to process a single manifest list file
to get them. Furthermore, since estimated stats are good enough for CBOs,
the engines can ignore manifest files that track delete files.

Another benefit of this alternative solution is that it's a lightweight
additive operation to update manifest-file-level stats, and the writers can
do this when they ingest data.

What do you think?

Thanks,
Xingyuan

Reply via email to