Thanks for proposing this.

My main concern is that this doesn't seem to be aimed at standardizing this
metadata, but rather a way to pass existing Hive structures in a different
way. I commented on the PR, but I'll carry it over here for this discussion.

Iceberg already supports tracking column lower bounds, upper bounds, column
counts, null value counts, and NaN counts per column in table metadata. The
existing stats use column ID rather than column name so that stats are
compatible with schema evolution. These are kept at the file level and we
are adding partition stats files to handle aggregates at the partition
level. We also use snapshot summaries for similar purposes at the table
level. The proposed Hive structure doesn't seem like it adds much, and, if
anything, would be a feature regression because it doesn't use column IDs
and has extra metadata that would not be accurate (like `isPrimaryKey`).

The Puffin format also has an NDV sketch that is already in use (thanks,
Piotr!) so that seems to duplicate existing functionality as well.

The KLL sketch seems useful to me, but I would separate that from the Hive
blob.

Ryan

On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko <dkuzme...@apache.org> wrote:

> Hi Gabor,
>
> Thanks for your feedback!
>
> > In that use case however, we'd lose the stats we got previously from HMS
>
> For Iceberg tables Hive computes and stores the same stats object in a
> puffin file, previously persisted to HMS. So, there shouldn't be any
> changes for Impala other than changing the stats source.
>
> > We could gather all the column stats needed by different engines,
> standardize them into the Iceberg repo
>
> That is an option I mentioned above and provided the Hive schema,
> currently used to store column statistics.
> I can create a google doc to continue the discussion in that direction.
>
> > Aren't partition status just a more granular way of column stats.
>
> In Iceberg 1.7 Ajantha added a helper method to compute the basic
> partition stats for the given snapshot.
> Collection<PartitionStats> computeStats(Table table, Snapshot snapshot)
>
> Hopefully, we'll get reader and writer support in 1.8:
> https://github.com/apache/iceberg/pull/11216
>
> A similar functionality is needed for column stats.
> In the case of a partitioned table, we need to create 1 ColumnStatistics
> object per partition and store it as a separate blob in a puffin file.
>
> During the query planning, we'll compute and use aggregated stats based on
> a pruned partition list.
>
> Regards,
> Denys
>

Reply via email to