Thanks Denys for starting this discussion! Thanks Ryan, i agree it would be better to have engine agnostic data structures in the Blobs we maintain in the Iceberg project. At least for the "standard blob types". Note however that Puffin format is intentionally open-ended. An application can put additional sketch/blob types without having to have them standardized first. This is helpful for experimentation with new stuff. This also removes the need to standardize blob types that we don't expect to be universally useful.
Best Piotr On Tue, 4 Feb 2025 at 18:49, rdb...@gmail.com <rdb...@gmail.com> wrote: > Thanks for proposing this. > > My main concern is that this doesn't seem to be aimed at standardizing > this metadata, but rather a way to pass existing Hive structures in a > different way. I commented on the PR, but I'll carry it over here for this > discussion. > > Iceberg already supports tracking column lower bounds, upper bounds, > column counts, null value counts, and NaN counts per column in table > metadata. The existing stats use column ID rather than column name so that > stats are compatible with schema evolution. These are kept at the file > level and we are adding partition stats files to handle aggregates at the > partition level. We also use snapshot summaries for similar purposes at the > table level. The proposed Hive structure doesn't seem like it adds much, > and, if anything, would be a feature regression because it doesn't use > column IDs and has extra metadata that would not be accurate (like > `isPrimaryKey`). > > The Puffin format also has an NDV sketch that is already in use (thanks, > Piotr!) so that seems to duplicate existing functionality as well. > > The KLL sketch seems useful to me, but I would separate that from the Hive > blob. > > Ryan > > On Tue, Feb 4, 2025 at 8:16 AM Denys Kuzmenko <dkuzme...@apache.org> > wrote: > >> Hi Gabor, >> >> Thanks for your feedback! >> >> > In that use case however, we'd lose the stats we got previously from HMS >> >> For Iceberg tables Hive computes and stores the same stats object in a >> puffin file, previously persisted to HMS. So, there shouldn't be any >> changes for Impala other than changing the stats source. >> >> > We could gather all the column stats needed by different engines, >> standardize them into the Iceberg repo >> >> That is an option I mentioned above and provided the Hive schema, >> currently used to store column statistics. >> I can create a google doc to continue the discussion in that direction. >> >> > Aren't partition status just a more granular way of column stats. >> >> In Iceberg 1.7 Ajantha added a helper method to compute the basic >> partition stats for the given snapshot. >> Collection<PartitionStats> computeStats(Table table, Snapshot snapshot) >> >> Hopefully, we'll get reader and writer support in 1.8: >> https://github.com/apache/iceberg/pull/11216 >> >> A similar functionality is needed for column stats. >> In the case of a partitioned table, we need to create 1 ColumnStatistics >> object per partition and store it as a separate blob in a puffin file. >> >> During the query planning, we'll compute and use aggregated stats based >> on a pruned partition list. >> >> Regards, >> Denys >> >