Isn't this addressed by the partition statistics feature, or do you want to have one row for the entire table?
On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> wrote: > I would like to initiate a discussion on implementing a table-level > statistics file to store column statistics, specifically min, max, and null > counts. The original discussion can be found in this Slack thread: > https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779. > > In Spark 3.4, I introduced Column Statistics > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> > within the Statistics interface > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, > enabling Iceberg to implement and report these metrics to Spark. As a > result, Spark can utilize the column statistics for cost-based optimization > (CBO), including join reordering. > > Here’s how the process operates: > > > - Use the ANALYZE TABLE command to compute column statistics and store > them appropriately. The SQL syntax is: > > ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… > > This command will trigger a ComputeTableStatsSparkAction to compute a > data-sketch for NDV (number of distinct values) and saves this in the > StatisticsFile. Here is the ComputeTableStatsAction PR > <https://github.com/apache/iceberg/pull/10288> > > Additionally, the ANALYZE TABLE command needs to calculate table-level > min, max, and null counts for columns like col1 and col2 using the manifest > files. These table-level statistics are then saved. Currently, we do not > have a designated location to store these table-level column statistics. > The StatisticsFile > <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, > being tightly coupled with the Puffin file, does not seem to be a suitable > repository for these column stats. > > > - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS > > <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> > is enabled, NDV will be retrieved from the StatisticsFile, and the > min/max/null counts will also be retrieved from a table-level column > statistics file. These statistics will be utilized to construct the > Statistics object in estimateStatistics > > <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. > Spark then employs the statistics returned from estimateStatistics > > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> > for cost-based optimization (CBO). > > Please share your thoughts on this. > > Thanks, > > Huaxin > > > >