Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Gang Wu Tue, 06 Aug 2024 08:32:39 -0700

Just give my two cents. Not all tables have partition definition and
table-level stats would
benefit these tables. In addition, NDV might not be easily populated from
partition-level
statistics.


Thanks,
Gang

On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote:

> Thanks for raising the discussion Huaxin.
>
> I also think partition-level statistics file(s) are more useful and has
> advantage over table-level stats. For instance:
> 1. It would be straight forward to support incremental stats computing for
> large tables: by recalculating new or updated partitions only
> 2. Table level stats could be built from partition level stats easily. It
> might take some additional time to merge though.
> 3. SparkScan usually involve parts of partitions if `partition filter` is
> involved, it would be more accurate to infer the stats from the selected
> partitions only
>
> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> wrote:
>
> Thanks for starting this thread Huaxin,
>
> The existing statistics, on a per data file basis, are definitely too
> granular for use in planning/analysis time query optimizations.
> It's worked so far, as tables have been relatively small, but from what
> I've seen in the Trino community it is starting to be a problem for some.
>
> However, I'm not sure that rolling the stats all the way up to the table
> level is what we want to add next.
> Would extending the partition stats to include some of the data present
> for data files be possible?
> To me, it seems like doing some amount of derivation at query time is
> okay, as long as the time it takes to do the derivation doesn't increase
> significantly as the table gets larger.
>
> Partition level stats also have the advantage of being able to provide
> more accurate estimates for queries with filters on the partition columns.
>
> Looking forward to talking more about the project,
> Alex Jo
>
>
>
> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> wrote:
>
>> Thanks, Samrose and Piotr, for the discussion! This issue is not
>> addressed by the partition statistics feature. What we need are table level
>> stats.
>>
>> Given that our primary goal in collecting statistics is for performance
>> optimization, I believe it's not a good approach to derive these statistics
>> at query execution time. That's why I propose saving these metrics in the
>> table-level stats file. I am thinking of reusing the existing aggregate
>> pushdown mechanism to compute min, max and null counts. We currently have
>> MinAggregate
>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
>> MaxAggregate
>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
>> CountStar
>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
>> and CountNotNull
>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
>> Null counts can be derived from CountStar and CountNotNull.
>>
>> Thanks,
>> Huaxin
>>
>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <piotr.findei...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> First of all, thank you Huaxin for raising this topic. It's important
>>> for Spark, but also for Trino.
>>>
>>> Min, max, and null counts can be derived from manifests.
>>> I am not saying that a query engine should derive them from manifests at
>>> query time, but it definitely can.
>>> If we want to pull min, max, and null counts table-level summary into
>>> stats Puffin file (which probably makes sense),  my concern would be what
>>> the exact mechanics for that should be. NDVs need to be derived from the
>>> data files, but maybe we can do something smarter for min, max, and null
>>> counts.
>>>
>>> Best,
>>> Piotr
>>>
>>>
>>>
>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com>
>>> wrote:
>>>
>>>> Isn't this addressed by the partition statistics feature, or do you
>>>> want to have one row for the entire table?
>>>>
>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com>
>>>> wrote:
>>>>
>>>>> I would like to initiate a discussion on implementing a table-level
>>>>> statistics file to store column statistics, specifically min, max, and 
>>>>> null
>>>>> counts. The original discussion can be found in this Slack thread:
>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>>> .
>>>>>
>>>>> In Spark 3.4, I introduced Column Statistics
>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>>> within the Statistics interface
>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>>> result, Spark can utilize the column statistics for cost-based 
>>>>> optimization
>>>>> (CBO), including join reordering.
>>>>>
>>>>> Here’s how the process operates:
>>>>>
>>>>>
>>>>>    - Use the ANALYZE TABLE command to compute column statistics and
>>>>>    store them appropriately. The SQL syntax is:
>>>>>
>>>>>
>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>>
>>>>> This command will trigger a ComputeTableStatsSparkAction to compute a
>>>>> data-sketch for NDV (number of distinct values) and saves this in the
>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR
>>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>>
>>>>> Additionally, the ANALYZE TABLE command needs to calculate
>>>>> table-level min, max, and null counts for columns like col1 and col2 using
>>>>> the manifest files. These table-level statistics are then saved. 
>>>>> Currently,
>>>>> we do not have a designated location to store these table-level column
>>>>> statistics. The StatisticsFile
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>>> being tightly coupled with the Puffin file, does not seem to be a suitable
>>>>> repository for these column stats.
>>>>>
>>>>>
>>>>>
>>>>>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>>>>>    
>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>>    min/max/null counts will also be retrieved from a table-level column
>>>>>    statistics file. These statistics will be utilized to construct the
>>>>>    Statistics object in estimateStatistics
>>>>>    
>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>>    Spark then employs the statistics returned from estimateStatistics
>>>>>    
>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>>    for cost-based optimization (CBO).
>>>>>
>>>>>
>>>>> Please share your thoughts on this.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Huaxin
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to