Hi Gabor,

Thanks for your feedback!

> In that use case however, we'd lose the stats we got previously from HMS

For Iceberg tables Hive computes and stores the same stats object in a puffin 
file, previously persisted to HMS. So, there shouldn't be any changes for 
Impala other than changing the stats source. 

> We could gather all the column stats needed by different engines, standardize 
> them into the Iceberg repo

That is an option I mentioned above and provided the Hive schema, currently 
used to store column statistics. 
I can create a google doc to continue the discussion in that direction.

> Aren't partition status just a more granular way of column stats. 

In Iceberg 1.7 Ajantha added a helper method to compute the basic partition 
stats for the given snapshot. 
Collection<PartitionStats> computeStats(Table table, Snapshot snapshot)

Hopefully, we'll get reader and writer support in 1.8: 
https://github.com/apache/iceberg/pull/11216

A similar functionality is needed for column stats. 
In the case of a partitioned table, we need to create 1 ColumnStatistics object 
per partition and store it as a separate blob in a puffin file.

During the query planning, we'll compute and use aggregated stats based on a 
pruned partition list.

Regards,
Denys

Reply via email to