I would like to initiate a discussion on implementing a table-level
statistics file to store column statistics, specifically min, max, and null
counts. The original discussion can be found in this Slack thread:
https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779.

In Spark 3.4, I introduced Column Statistics
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
within the Statistics interface
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
enabling Iceberg to implement and report these metrics to Spark. As a
result, Spark can utilize the column statistics for cost-based optimization
(CBO), including join reordering.

Here’s how the process operates:


   - Use the ANALYZE TABLE command to compute column statistics and store
   them appropriately. The SQL syntax is:

ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…

This command will trigger a ComputeTableStatsSparkAction to compute a
data-sketch for NDV (number of distinct values) and saves this in the
StatisticsFile. Here is the ComputeTableStatsAction PR
<https://github.com/apache/iceberg/pull/10288>

Additionally, the ANALYZE TABLE command needs to calculate table-level min,
max, and null counts for columns like col1 and col2 using the manifest
files. These table-level statistics are then saved. Currently, we do not
have a designated location to store these table-level column statistics.
The StatisticsFile
<https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
being tightly coupled with the Puffin file, does not seem to be a suitable
repository for these column stats.


   - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
   
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
   is enabled, NDV will be retrieved from the StatisticsFile, and the
   min/max/null counts will also be retrieved from a table-level column
   statistics file. These statistics will be utilized to construct the
   Statistics object in estimateStatistics
   
<https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
   Spark then employs the statistics returned from estimateStatistics
   
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
   for cost-based optimization (CBO).

Please share your thoughts on this.

Thanks,

Huaxin

Reply via email to