Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

huaxin gao Tue, 06 Aug 2024 20:06:56 -0700

Thanks Alexander, Xianjin, Gang and Anton for your valuable insights!

Regarding deriving min/max values on the fly, I currently don't have a good
algorithm. I rely on iterating through FileScanTask objects in memory to
aggregate results, which leads me to favor pre calculating min/max values.


I have a slight preference for table-level stats over partition-level
stats. Given that we already maintain tabel-level NDV, it seems consistent
to include corresponding table-level stats for min/max values. When
computing the table-level stats, if partition filters are applied, we have
already filtered out unwanted partitions, which, in my view, does not give
partition-level stats an advantage in this context. For
incremental updates, even with table level stats, I think we can still have
a way to calculate only the new or updated partitions and then update the
table level stats. However, I am open to other solutions.

Thanks,
Huaxin


On Tue, Aug 6, 2024 at 10:40 AM Anton Okolnychyi <aokolnyc...@gmail.com>
wrote:

> I'd like to entertain the idea of deriving min/max values on the fly to
> understand our baseline. What will the algorithm for that look like? I
> assume the naive approach will be to keep min/max stats for selected
> columns while planning (as opposed to discarding them upon filtering) and
> then iterate through FileScanTask objects in memory to aggregate the
> results?
>
> - Anton
>
> вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише:
>
>> Just give my two cents. Not all tables have partition definition and
>> table-level stats would
>> benefit these tables. In addition, NDV might not be easily populated from
>> partition-level
>> statistics.
>>
>> Thanks,
>> Gang
>>
>> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote:
>>
>>> Thanks for raising the discussion Huaxin.
>>>
>>> I also think partition-level statistics file(s) are more useful and has
>>> advantage over table-level stats. For instance:
>>> 1. It would be straight forward to support incremental stats computing
>>> for large tables: by recalculating new or updated partitions only
>>> 2. Table level stats could be built from partition level stats easily.
>>> It might take some additional time to merge though.
>>> 3. SparkScan usually involve parts of partitions if `partition filter`
>>> is involved, it would be more accurate to infer the stats from the selected
>>> partitions only
>>>
>>> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com>
>>> wrote:
>>>
>>> Thanks for starting this thread Huaxin,
>>>
>>> The existing statistics, on a per data file basis, are definitely too
>>> granular for use in planning/analysis time query optimizations.
>>> It's worked so far, as tables have been relatively small, but from what
>>> I've seen in the Trino community it is starting to be a problem for some.
>>>
>>> However, I'm not sure that rolling the stats all the way up to the table
>>> level is what we want to add next.
>>> Would extending the partition stats to include some of the data present
>>> for data files be possible?
>>> To me, it seems like doing some amount of derivation at query time is
>>> okay, as long as the time it takes to do the derivation doesn't increase
>>> significantly as the table gets larger.
>>>
>>> Partition level stats also have the advantage of being able to provide
>>> more accurate estimates for queries with filters on the partition columns.
>>>
>>> Looking forward to talking more about the project,
>>> Alex Jo
>>>
>>>
>>>
>>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com>
>>> wrote:
>>>
>>>> Thanks, Samrose and Piotr, for the discussion! This issue is not
>>>> addressed by the partition statistics feature. What we need are table level
>>>> stats.
>>>>
>>>> Given that our primary goal in collecting statistics is for performance
>>>> optimization, I believe it's not a good approach to derive these statistics
>>>> at query execution time. That's why I propose saving these metrics in the
>>>> table-level stats file. I am thinking of reusing the existing aggregate
>>>> pushdown mechanism to compute min, max and null counts. We currently have
>>>> MinAggregate
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
>>>> MaxAggregate
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
>>>> CountStar
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
>>>> and CountNotNull
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
>>>> Null counts can be derived from CountStar and CountNotNull.
>>>>
>>>> Thanks,
>>>> Huaxin
>>>>
>>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <
>>>> piotr.findei...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> First of all, thank you Huaxin for raising this topic. It's important
>>>>> for Spark, but also for Trino.
>>>>>
>>>>> Min, max, and null counts can be derived from manifests.
>>>>> I am not saying that a query engine should derive them from manifests
>>>>> at query time, but it definitely can.
>>>>> If we want to pull min, max, and null counts table-level summary into
>>>>> stats Puffin file (which probably makes sense),  my concern would be what
>>>>> the exact mechanics for that should be. NDVs need to be derived from the
>>>>> data files, but maybe we can do something smarter for min, max, and null
>>>>> counts.
>>>>>
>>>>> Best,
>>>>> Piotr
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Isn't this addressed by the partition statistics feature, or do you
>>>>>> want to have one row for the entire table?
>>>>>>
>>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I would like to initiate a discussion on implementing a table-level
>>>>>>> statistics file to store column statistics, specifically min, max, and 
>>>>>>> null
>>>>>>> counts. The original discussion can be found in this Slack thread:
>>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>>>>> .
>>>>>>>
>>>>>>> In Spark 3.4, I introduced Column Statistics
>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>>>>> within the Statistics interface
>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>>>>> result, Spark can utilize the column statistics for cost-based 
>>>>>>> optimization
>>>>>>> (CBO), including join reordering.
>>>>>>>
>>>>>>> Here’s how the process operates:
>>>>>>>
>>>>>>>
>>>>>>>    - Use the ANALYZE TABLE command to compute column statistics and
>>>>>>>    store them appropriately. The SQL syntax is:
>>>>>>>
>>>>>>>
>>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>>>>
>>>>>>> This command will trigger a ComputeTableStatsSparkAction to compute
>>>>>>> a data-sketch for NDV (number of distinct values) and saves this in the
>>>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR
>>>>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>>>>
>>>>>>> Additionally, the ANALYZE TABLE command needs to calculate
>>>>>>> table-level min, max, and null counts for columns like col1 and col2 
>>>>>>> using
>>>>>>> the manifest files. These table-level statistics are then saved. 
>>>>>>> Currently,
>>>>>>> we do not have a designated location to store these table-level column
>>>>>>> statistics. The StatisticsFile
>>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>>>>> being tightly coupled with the Puffin file, does not seem to be a 
>>>>>>> suitable
>>>>>>> repository for these column stats.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>>>>>>>    
>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>>>>    min/max/null counts will also be retrieved from a table-level column
>>>>>>>    statistics file. These statistics will be utilized to construct the
>>>>>>>    Statistics object in estimateStatistics
>>>>>>>    
>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>>>>    Spark then employs the statistics returned from
>>>>>>>    estimateStatistics
>>>>>>>    
>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>>>>    for cost-based optimization (CBO).
>>>>>>>
>>>>>>>
>>>>>>> Please share your thoughts on this.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Huaxin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to