Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Piotr Findeisen Wed, 07 Aug 2024 03:38:41 -0700

Hi All,

Thank you for interesting discussion so far, and many view points shared!



> Not all tables have partition definition and table-level stats would
benefit these tables

Agreed that tables not always have partitions.
Current partition stats are appropriate for partitioned tables only mainly
because they contain information that's available in table metadata anyway.

If we evolve partition stats to also include column min/maxes, I would
advocate for partition stats to be applicable to unpartitioned table as
well, perhaps just by omitting the `1 partition` field in such case.

The partition-level information has the important advantage that is allows
a query engine to come up with best estimate for *current query*, taking
the query effective predicates into account and pruning the information.
So sooner or later we will want to add min/max information to partition
stats. (is it WIP already?)
Since partition stats are Parquet, deriving table-level min/max information
from partition-level min/max information is as simple as reading the
Parquet file footer.

> Given that we already maintain tabel-level NDV, it seems consistent to
include corresponding table-level stats for min/max values.

That's a reasonable thinking and consistency is important.
There is, however, an important difference between NDV and min/max. The
table-level NDV cannot be easily aggregated from partition-level NDV,
because it quickly turns out to be CPU intensive process.
Keeping partition-level NDV information (number, and a sketch) is a good
idea, but it's suitable only for queries which read one to few partitions.
Queries reading large number of partitions would still prefer to use
table-level NDV information (if available).

Best
Piotr









On Wed, 7 Aug 2024 at 05:07, huaxin gao <[email protected]> wrote:

> Thanks Alexander, Xianjin, Gang and Anton for your valuable insights!
>
> Regarding deriving min/max values on the fly, I currently don't have a
> good algorithm. I rely on iterating through FileScanTask objects in memory
> to aggregate results, which leads me to favor pre calculating min/max
> values.
>
> I have a slight preference for table-level stats over partition-level
> stats. Given that we already maintain tabel-level NDV, it seems consistent
> to include corresponding table-level stats for min/max values. When
> computing the table-level stats, if partition filters are applied, we have
> already filtered out unwanted partitions, which, in my view, does not give
> partition-level stats an advantage in this context. For
> incremental updates, even with table level stats, I think we can still have
> a way to calculate only the new or updated partitions and then update the
> table level stats. However, I am open to other solutions.
>
> Thanks,
> Huaxin
>
>
> On Tue, Aug 6, 2024 at 10:40 AM Anton Okolnychyi <[email protected]>
> wrote:
>
>> I'd like to entertain the idea of deriving min/max values on the fly to
>> understand our baseline. What will the algorithm for that look like? I
>> assume the naive approach will be to keep min/max stats for selected
>> columns while planning (as opposed to discarding them upon filtering) and
>> then iterate through FileScanTask objects in memory to aggregate the
>> results?
>>
>> - Anton
>>
>> вт, 6 серп. 2024 р. о 08:33 Gang Wu <[email protected]> пише:
>>
>>> Just give my two cents. Not all tables have partition definition and
>>> table-level stats would
>>> benefit these tables. In addition, NDV might not be easily populated
>>> from partition-level
>>> statistics.
>>>
>>> Thanks,
>>> Gang
>>>
>>> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <[email protected]> wrote:
>>>
>>>> Thanks for raising the discussion Huaxin.
>>>>
>>>> I also think partition-level statistics file(s) are more useful and has
>>>> advantage over table-level stats. For instance:
>>>> 1. It would be straight forward to support incremental stats computing
>>>> for large tables: by recalculating new or updated partitions only
>>>> 2. Table level stats could be built from partition level stats easily.
>>>> It might take some additional time to merge though.
>>>> 3. SparkScan usually involve parts of partitions if `partition filter`
>>>> is involved, it would be more accurate to infer the stats from the selected
>>>> partitions only
>>>>
>>>> On Aug 6, 2024, at 02:42, Alexander Jo <[email protected]>
>>>> wrote:
>>>>
>>>> Thanks for starting this thread Huaxin,
>>>>
>>>> The existing statistics, on a per data file basis, are definitely too
>>>> granular for use in planning/analysis time query optimizations.
>>>> It's worked so far, as tables have been relatively small, but from what
>>>> I've seen in the Trino community it is starting to be a problem for some.
>>>>
>>>> However, I'm not sure that rolling the stats all the way up to the
>>>> table level is what we want to add next.
>>>> Would extending the partition stats to include some of the data present
>>>> for data files be possible?
>>>> To me, it seems like doing some amount of derivation at query time is
>>>> okay, as long as the time it takes to do the derivation doesn't increase
>>>> significantly as the table gets larger.
>>>>
>>>> Partition level stats also have the advantage of being able to provide
>>>> more accurate estimates for queries with filters on the partition columns.
>>>>
>>>> Looking forward to talking more about the project,
>>>> Alex Jo
>>>>
>>>>
>>>>
>>>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks, Samrose and Piotr, for the discussion! This issue is not
>>>>> addressed by the partition statistics feature. What we need are table 
>>>>> level
>>>>> stats.
>>>>>
>>>>> Given that our primary goal in collecting statistics is for
>>>>> performance optimization, I believe it's not a good approach to derive
>>>>> these statistics at query execution time. That's why I propose saving 
>>>>> these
>>>>> metrics in the table-level stats file. I am thinking of reusing the
>>>>> existing aggregate pushdown mechanism to compute min, max and null counts.
>>>>> We currently have MinAggregate
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
>>>>> MaxAggregate
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
>>>>> CountStar
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
>>>>> and CountNotNull
>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
>>>>> Null counts can be derived from CountStar and CountNotNull.
>>>>>
>>>>> Thanks,
>>>>> Huaxin
>>>>>
>>>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> First of all, thank you Huaxin for raising this topic. It's important
>>>>>> for Spark, but also for Trino.
>>>>>>
>>>>>> Min, max, and null counts can be derived from manifests.
>>>>>> I am not saying that a query engine should derive them from manifests
>>>>>> at query time, but it definitely can.
>>>>>> If we want to pull min, max, and null counts table-level summary into
>>>>>> stats Puffin file (which probably makes sense),  my concern would be what
>>>>>> the exact mechanics for that should be. NDVs need to be derived from the
>>>>>> data files, but maybe we can do something smarter for min, max, and null
>>>>>> counts.
>>>>>>
>>>>>> Best,
>>>>>> Piotr
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Isn't this addressed by the partition statistics feature, or do you
>>>>>>> want to have one row for the entire table?
>>>>>>>
>>>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would like to initiate a discussion on implementing a table-level
>>>>>>>> statistics file to store column statistics, specifically min, max, and 
>>>>>>>> null
>>>>>>>> counts. The original discussion can be found in this Slack thread:
>>>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>>>>>> .
>>>>>>>>
>>>>>>>> In Spark 3.4, I introduced Column Statistics
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>>>>>> within the Statistics interface
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>>>>>> result, Spark can utilize the column statistics for cost-based 
>>>>>>>> optimization
>>>>>>>> (CBO), including join reordering.
>>>>>>>>
>>>>>>>> Here’s how the process operates:
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Use the ANALYZE TABLE command to compute column statistics
>>>>>>>>    and store them appropriately. The SQL syntax is:
>>>>>>>>
>>>>>>>>
>>>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>>>>>
>>>>>>>> This command will trigger a ComputeTableStatsSparkAction to compute
>>>>>>>> a data-sketch for NDV (number of distinct values) and saves this in the
>>>>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR
>>>>>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>>>>>
>>>>>>>> Additionally, the ANALYZE TABLE command needs to calculate
>>>>>>>> table-level min, max, and null counts for columns like col1 and col2 
>>>>>>>> using
>>>>>>>> the manifest files. These table-level statistics are then saved. 
>>>>>>>> Currently,
>>>>>>>> we do not have a designated location to store these table-level column
>>>>>>>> statistics. The StatisticsFile
>>>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>>>>>> being tightly coupled with the Puffin file, does not seem to be a 
>>>>>>>> suitable
>>>>>>>> repository for these column stats.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - If Spark's CBO is on and
>>>>>>>>    SparkSQLProperties.REPORT_COLUMN_STATS
>>>>>>>>    
>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>>>>>    min/max/null counts will also be retrieved from a table-level column
>>>>>>>>    statistics file. These statistics will be utilized to construct the
>>>>>>>>    Statistics object in estimateStatistics
>>>>>>>>    
>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>>>>>    Spark then employs the statistics returned from
>>>>>>>>    estimateStatistics
>>>>>>>>    
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>>>>>    for cost-based optimization (CBO).
>>>>>>>>
>>>>>>>>
>>>>>>>> Please share your thoughts on this.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Huaxin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to