Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Manish Malhotra Wed, 07 Aug 2024 07:24:59 -0700

First of all thanks a lot Huaxin for starting an important proposal and
thread!


A lot of important points are already discussed.

For me, my thoughts were also tilting towards the partition level stats,
what Piotr, Alex, Anton and a few others have mentioned as well.

IMO, partition level stats might be a good middle ground.

Plus before that, does it make sense, we can try to derive the stats (in
the current state) and see, does it adds a lot of time?

Regards,
Manish

On Wed, Aug 7, 2024 at 3:38 AM Piotr Findeisen <piotr.findei...@gmail.com>
wrote:

> Hi All,
>
> Thank you for interesting discussion so far, and many view points shared!
>
>
> > Not all tables have partition definition and table-level stats would
> benefit these tables
>
> Agreed that tables not always have partitions.
> Current partition stats are appropriate for partitioned tables only mainly
> because they contain information that's available in table metadata anyway.
>
> If we evolve partition stats to also include column min/maxes, I would
> advocate for partition stats to be applicable to unpartitioned table as
> well, perhaps just by omitting the `1 partition` field in such case.
>
> The partition-level information has the important advantage that is allows
> a query engine to come up with best estimate for *current query*, taking
> the query effective predicates into account and pruning the information.
> So sooner or later we will want to add min/max information to partition
> stats. (is it WIP already?)
> Since partition stats are Parquet, deriving table-level min/max
> information from partition-level min/max information is as simple as
> reading the Parquet file footer.
>
> > Given that we already maintain tabel-level NDV, it seems consistent to
> include corresponding table-level stats for min/max values.
>
> That's a reasonable thinking and consistency is important.
> There is, however, an important difference between NDV and min/max. The
> table-level NDV cannot be easily aggregated from partition-level NDV,
> because it quickly turns out to be CPU intensive process.
> Keeping partition-level NDV information (number, and a sketch) is a good
> idea, but it's suitable only for queries which read one to few partitions.
> Queries reading large number of partitions would still prefer to use
> table-level NDV information (if available).
>
> Best
> Piotr
>
>
>
>
>
>
>
>
>
> On Wed, 7 Aug 2024 at 05:07, huaxin gao <huaxin.ga...@gmail.com> wrote:
>
>> Thanks Alexander, Xianjin, Gang and Anton for your valuable insights!
>>
>> Regarding deriving min/max values on the fly, I currently don't have a
>> good algorithm. I rely on iterating through FileScanTask objects in memory
>> to aggregate results, which leads me to favor pre calculating min/max
>> values.
>>
>> I have a slight preference for table-level stats over partition-level
>> stats. Given that we already maintain tabel-level NDV, it seems consistent
>> to include corresponding table-level stats for min/max values. When
>> computing the table-level stats, if partition filters are applied, we have
>> already filtered out unwanted partitions, which, in my view, does not give
>> partition-level stats an advantage in this context. For
>> incremental updates, even with table level stats, I think we can still have
>> a way to calculate only the new or updated partitions and then update the
>> table level stats. However, I am open to other solutions.
>>
>> Thanks,
>> Huaxin
>>
>>
>> On Tue, Aug 6, 2024 at 10:40 AM Anton Okolnychyi <aokolnyc...@gmail.com>
>> wrote:
>>
>>> I'd like to entertain the idea of deriving min/max values on the fly to
>>> understand our baseline. What will the algorithm for that look like? I
>>> assume the naive approach will be to keep min/max stats for selected
>>> columns while planning (as opposed to discarding them upon filtering) and
>>> then iterate through FileScanTask objects in memory to aggregate the
>>> results?
>>>
>>> - Anton
>>>
>>> вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише:
>>>
>>>> Just give my two cents. Not all tables have partition definition and
>>>> table-level stats would
>>>> benefit these tables. In addition, NDV might not be easily populated
>>>> from partition-level
>>>> statistics.
>>>>
>>>> Thanks,
>>>> Gang
>>>>
>>>> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote:
>>>>
>>>>> Thanks for raising the discussion Huaxin.
>>>>>
>>>>> I also think partition-level statistics file(s) are more useful and
>>>>> has advantage over table-level stats. For instance:
>>>>> 1. It would be straight forward to support incremental stats computing
>>>>> for large tables: by recalculating new or updated partitions only
>>>>> 2. Table level stats could be built from partition level stats easily.
>>>>> It might take some additional time to merge though.
>>>>> 3. SparkScan usually involve parts of partitions if `partition filter`
>>>>> is involved, it would be more accurate to infer the stats from the 
>>>>> selected
>>>>> partitions only
>>>>>
>>>>> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com>
>>>>> wrote:
>>>>>
>>>>> Thanks for starting this thread Huaxin,
>>>>>
>>>>> The existing statistics, on a per data file basis, are definitely too
>>>>> granular for use in planning/analysis time query optimizations.
>>>>> It's worked so far, as tables have been relatively small, but from
>>>>> what I've seen in the Trino community it is starting to be a problem for
>>>>> some.
>>>>>
>>>>> However, I'm not sure that rolling the stats all the way up to the
>>>>> table level is what we want to add next.
>>>>> Would extending the partition stats to include some of the data
>>>>> present for data files be possible?
>>>>> To me, it seems like doing some amount of derivation at query time is
>>>>> okay, as long as the time it takes to do the derivation doesn't increase
>>>>> significantly as the table gets larger.
>>>>>
>>>>> Partition level stats also have the advantage of being able to provide
>>>>> more accurate estimates for queries with filters on the partition columns.
>>>>>
>>>>> Looking forward to talking more about the project,
>>>>> Alex Jo
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks, Samrose and Piotr, for the discussion! This issue is not
>>>>>> addressed by the partition statistics feature. What we need are table 
>>>>>> level
>>>>>> stats.
>>>>>>
>>>>>> Given that our primary goal in collecting statistics is for
>>>>>> performance optimization, I believe it's not a good approach to derive
>>>>>> these statistics at query execution time. That's why I propose saving 
>>>>>> these
>>>>>> metrics in the table-level stats file. I am thinking of reusing the
>>>>>> existing aggregate pushdown mechanism to compute min, max and null 
>>>>>> counts.
>>>>>> We currently have MinAggregate
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
>>>>>> MaxAggregate
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
>>>>>> CountStar
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
>>>>>> and CountNotNull
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
>>>>>> Null counts can be derived from CountStar and CountNotNull.
>>>>>>
>>>>>> Thanks,
>>>>>> Huaxin
>>>>>>
>>>>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <
>>>>>> piotr.findei...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> First of all, thank you Huaxin for raising this topic. It's
>>>>>>> important for Spark, but also for Trino.
>>>>>>>
>>>>>>> Min, max, and null counts can be derived from manifests.
>>>>>>> I am not saying that a query engine should derive them from
>>>>>>> manifests at query time, but it definitely can.
>>>>>>> If we want to pull min, max, and null counts table-level summary
>>>>>>> into stats Puffin file (which probably makes sense),  my concern would 
>>>>>>> be
>>>>>>> what the exact mechanics for that should be. NDVs need to be derived 
>>>>>>> from
>>>>>>> the data files, but maybe we can do something smarter for min, max, and
>>>>>>> null counts.
>>>>>>>
>>>>>>> Best,
>>>>>>> Piotr
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Isn't this addressed by the partition statistics feature, or do you
>>>>>>>> want to have one row for the entire table?
>>>>>>>>
>>>>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would like to initiate a discussion on implementing a
>>>>>>>>> table-level statistics file to store column statistics, specifically 
>>>>>>>>> min,
>>>>>>>>> max, and null counts. The original discussion can be found in this 
>>>>>>>>> Slack
>>>>>>>>> thread:
>>>>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> In Spark 3.4, I introduced Column Statistics
>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>>>>>>> within the Statistics interface
>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>>>>>>> result, Spark can utilize the column statistics for cost-based 
>>>>>>>>> optimization
>>>>>>>>> (CBO), including join reordering.
>>>>>>>>>
>>>>>>>>> Here’s how the process operates:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Use the ANALYZE TABLE command to compute column statistics
>>>>>>>>>    and store them appropriately. The SQL syntax is:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>>>>>>
>>>>>>>>> This command will trigger a ComputeTableStatsSparkAction to
>>>>>>>>> compute a data-sketch for NDV (number of distinct values) and saves 
>>>>>>>>> this in
>>>>>>>>> the StatisticsFile. Here is the ComputeTableStatsAction PR
>>>>>>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>>>>>>
>>>>>>>>> Additionally, the ANALYZE TABLE command needs to calculate
>>>>>>>>> table-level min, max, and null counts for columns like col1 and col2 
>>>>>>>>> using
>>>>>>>>> the manifest files. These table-level statistics are then saved. 
>>>>>>>>> Currently,
>>>>>>>>> we do not have a designated location to store these table-level column
>>>>>>>>> statistics. The StatisticsFile
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>>>>>>> being tightly coupled with the Puffin file, does not seem to be a 
>>>>>>>>> suitable
>>>>>>>>> repository for these column stats.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - If Spark's CBO is on and
>>>>>>>>>    SparkSQLProperties.REPORT_COLUMN_STATS
>>>>>>>>>    
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>>>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>>>>>>    min/max/null counts will also be retrieved from a table-level 
>>>>>>>>> column
>>>>>>>>>    statistics file. These statistics will be utilized to construct the
>>>>>>>>>    Statistics object in estimateStatistics
>>>>>>>>>    
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>>>>>>    Spark then employs the statistics returned from
>>>>>>>>>    estimateStatistics
>>>>>>>>>    
>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>>>>>>    for cost-based optimization (CBO).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please share your thoughts on this.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Huaxin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to