Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Samrose Ahmed Fri, 02 Aug 2024 11:47:02 -0700

Isn't this addressed by the partition statistics feature, or do you want to
have one row for the entire table?


On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> wrote:

> I would like to initiate a discussion on implementing a table-level
> statistics file to store column statistics, specifically min, max, and null
> counts. The original discussion can be found in this Slack thread:
> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779.
>
> In Spark 3.4, I introduced Column Statistics
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
> within the Statistics interface
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
> enabling Iceberg to implement and report these metrics to Spark. As a
> result, Spark can utilize the column statistics for cost-based optimization
> (CBO), including join reordering.
>
> Here’s how the process operates:
>
>
>    - Use the ANALYZE TABLE command to compute column statistics and store
>    them appropriately. The SQL syntax is:
>
> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>
> This command will trigger a ComputeTableStatsSparkAction to compute a
> data-sketch for NDV (number of distinct values) and saves this in the
> StatisticsFile. Here is the ComputeTableStatsAction PR
> <https://github.com/apache/iceberg/pull/10288>
>
> Additionally, the ANALYZE TABLE command needs to calculate table-level
> min, max, and null counts for columns like col1 and col2 using the manifest
> files. These table-level statistics are then saved. Currently, we do not
> have a designated location to store these table-level column statistics.
> The StatisticsFile
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
> being tightly coupled with the Puffin file, does not seem to be a suitable
> repository for these column stats.
>
>
>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>    
> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>    min/max/null counts will also be retrieved from a table-level column
>    statistics file. These statistics will be utilized to construct the
>    Statistics object in estimateStatistics
>    
> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>    Spark then employs the statistics returned from estimateStatistics
>    
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>    for cost-based optimization (CBO).
>
> Please share your thoughts on this.
>
> Thanks,
>
> Huaxin
>
>
>
>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to