Thanks for starting this thread Huaxin,

The existing statistics, on a per data file basis, are definitely too
granular for use in planning/analysis time query optimizations.
It's worked so far, as tables have been relatively small, but from what
I've seen in the Trino community it is starting to be a problem for some.

However, I'm not sure that rolling the stats all the way up to the table
level is what we want to add next.
Would extending the partition stats to include some of the data present for
data files be possible?
To me, it seems like doing some amount of derivation at query time is okay,
as long as the time it takes to do the derivation doesn't increase
significantly as the table gets larger.

Partition level stats also have the advantage of being able to provide more
accurate estimates for queries with filters on the partition columns.

Looking forward to talking more about the project,
Alex Jo



On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> wrote:

> Thanks, Samrose and Piotr, for the discussion! This issue is not addressed
> by the partition statistics feature. What we need are table level stats.
>
> Given that our primary goal in collecting statistics is for performance
> optimization, I believe it's not a good approach to derive these statistics
> at query execution time. That's why I propose saving these metrics in the
> table-level stats file. I am thinking of reusing the existing aggregate
> pushdown mechanism to compute min, max and null counts. We currently have
> MinAggregate
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
> MaxAggregate
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
> CountStar
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
> and CountNotNull
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
> Null counts can be derived from CountStar and CountNotNull.
>
> Thanks,
> Huaxin
>
> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <piotr.findei...@gmail.com>
> wrote:
>
>> Hi,
>>
>> First of all, thank you Huaxin for raising this topic. It's important for
>> Spark, but also for Trino.
>>
>> Min, max, and null counts can be derived from manifests.
>> I am not saying that a query engine should derive them from manifests at
>> query time, but it definitely can.
>> If we want to pull min, max, and null counts table-level summary into
>> stats Puffin file (which probably makes sense),  my concern would be what
>> the exact mechanics for that should be. NDVs need to be derived from the
>> data files, but maybe we can do something smarter for min, max, and null
>> counts.
>>
>> Best,
>> Piotr
>>
>>
>>
>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com>
>> wrote:
>>
>>> Isn't this addressed by the partition statistics feature, or do you want
>>> to have one row for the entire table?
>>>
>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> wrote:
>>>
>>>> I would like to initiate a discussion on implementing a table-level
>>>> statistics file to store column statistics, specifically min, max, and null
>>>> counts. The original discussion can be found in this Slack thread:
>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>> .
>>>>
>>>> In Spark 3.4, I introduced Column Statistics
>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>> within the Statistics interface
>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>> result, Spark can utilize the column statistics for cost-based optimization
>>>> (CBO), including join reordering.
>>>>
>>>> Here’s how the process operates:
>>>>
>>>>
>>>>    - Use the ANALYZE TABLE command to compute column statistics and
>>>>    store them appropriately. The SQL syntax is:
>>>>
>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>
>>>> This command will trigger a ComputeTableStatsSparkAction to compute a
>>>> data-sketch for NDV (number of distinct values) and saves this in the
>>>> StatisticsFile. Here is the ComputeTableStatsAction PR
>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>
>>>> Additionally, the ANALYZE TABLE command needs to calculate table-level
>>>> min, max, and null counts for columns like col1 and col2 using the manifest
>>>> files. These table-level statistics are then saved. Currently, we do not
>>>> have a designated location to store these table-level column statistics.
>>>> The StatisticsFile
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>> being tightly coupled with the Puffin file, does not seem to be a suitable
>>>> repository for these column stats.
>>>>
>>>>
>>>>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>>>>    
>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>    min/max/null counts will also be retrieved from a table-level column
>>>>    statistics file. These statistics will be utilized to construct the
>>>>    Statistics object in estimateStatistics
>>>>    
>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>    Spark then employs the statistics returned from estimateStatistics
>>>>    
>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>    for cost-based optimization (CBO).
>>>>
>>>> Please share your thoughts on this.
>>>>
>>>> Thanks,
>>>>
>>>> Huaxin
>>>>
>>>>
>>>>
>>>>

Reply via email to