I would like to initiate a discussion on implementing a table-level statistics file to store column statistics, specifically min, max, and null counts. The original discussion can be found in this Slack thread: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779.
In Spark 3.4, I introduced Column Statistics <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> within the Statistics interface <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, enabling Iceberg to implement and report these metrics to Spark. As a result, Spark can utilize the column statistics for cost-based optimization (CBO), including join reordering. Here’s how the process operates: - Use the ANALYZE TABLE command to compute column statistics and store them appropriately. The SQL syntax is: ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… This command will trigger a ComputeTableStatsSparkAction to compute a data-sketch for NDV (number of distinct values) and saves this in the StatisticsFile. Here is the ComputeTableStatsAction PR <https://github.com/apache/iceberg/pull/10288> Additionally, the ANALYZE TABLE command needs to calculate table-level min, max, and null counts for columns like col1 and col2 using the manifest files. These table-level statistics are then saved. Currently, we do not have a designated location to store these table-level column statistics. The StatisticsFile <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, being tightly coupled with the Puffin file, does not seem to be a suitable repository for these column stats. - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> is enabled, NDV will be retrieved from the StatisticsFile, and the min/max/null counts will also be retrieved from a table-level column statistics file. These statistics will be utilized to construct the Statistics object in estimateStatistics <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. Spark then employs the statistics returned from estimateStatistics <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> for cost-based optimization (CBO). Please share your thoughts on this. Thanks, Huaxin