Hi,

First of all, thank you Huaxin for raising this topic. It's important for
Spark, but also for Trino.

Min, max, and null counts can be derived from manifests.
I am not saying that a query engine should derive them from manifests at
query time, but it definitely can.
If we want to pull min, max, and null counts table-level summary into stats
Puffin file (which probably makes sense),  my concern would be what the
exact mechanics for that should be. NDVs need to be derived from the data
files, but maybe we can do something smarter for min, max, and null counts.

Best,
Piotr



On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com> wrote:

> Isn't this addressed by the partition statistics feature, or do you want
> to have one row for the entire table?
>
> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> wrote:
>
>> I would like to initiate a discussion on implementing a table-level
>> statistics file to store column statistics, specifically min, max, and null
>> counts. The original discussion can be found in this Slack thread:
>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779.
>>
>> In Spark 3.4, I introduced Column Statistics
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>> within the Statistics interface
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>> enabling Iceberg to implement and report these metrics to Spark. As a
>> result, Spark can utilize the column statistics for cost-based optimization
>> (CBO), including join reordering.
>>
>> Here’s how the process operates:
>>
>>
>>    - Use the ANALYZE TABLE command to compute column statistics and
>>    store them appropriately. The SQL syntax is:
>>
>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>
>> This command will trigger a ComputeTableStatsSparkAction to compute a
>> data-sketch for NDV (number of distinct values) and saves this in the
>> StatisticsFile. Here is the ComputeTableStatsAction PR
>> <https://github.com/apache/iceberg/pull/10288>
>>
>> Additionally, the ANALYZE TABLE command needs to calculate table-level
>> min, max, and null counts for columns like col1 and col2 using the manifest
>> files. These table-level statistics are then saved. Currently, we do not
>> have a designated location to store these table-level column statistics.
>> The StatisticsFile
>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>> being tightly coupled with the Puffin file, does not seem to be a suitable
>> repository for these column stats.
>>
>>
>>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>>    
>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>    min/max/null counts will also be retrieved from a table-level column
>>    statistics file. These statistics will be utilized to construct the
>>    Statistics object in estimateStatistics
>>    
>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>    Spark then employs the statistics returned from estimateStatistics
>>    
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>    for cost-based optimization (CBO).
>>
>> Please share your thoughts on this.
>>
>> Thanks,
>>
>> Huaxin
>>
>>
>>
>>

Reply via email to