I also like the middle ground of partition level stats, which is also easier to perform incremental refresh (at partition level). if the roll-up of partition level stats turned out to be slow, I don't mind adding table level stats aggregated from partition level stats. Having partition level stats first is a good foundation.
For an unpartitioned table, it can probably be treated as a special single partition as someone was suggesting? On Wed, Aug 7, 2024 at 7:25 AM Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > First of all thanks a lot Huaxin for starting an important proposal and > thread! > > A lot of important points are already discussed. > > For me, my thoughts were also tilting towards the partition level stats, > what Piotr, Alex, Anton and a few others have mentioned as well. > > IMO, partition level stats might be a good middle ground. > > Plus before that, does it make sense, we can try to derive the stats (in > the current state) and see, does it adds a lot of time? > > Regards, > Manish > > On Wed, Aug 7, 2024 at 3:38 AM Piotr Findeisen <piotr.findei...@gmail.com> > wrote: > >> Hi All, >> >> Thank you for interesting discussion so far, and many view points shared! >> >> >> > Not all tables have partition definition and table-level stats would >> benefit these tables >> >> Agreed that tables not always have partitions. >> Current partition stats are appropriate for partitioned tables only >> mainly because they contain information that's available in table metadata >> anyway. >> >> If we evolve partition stats to also include column min/maxes, I would >> advocate for partition stats to be applicable to unpartitioned table as >> well, perhaps just by omitting the `1 partition` field in such case. >> >> The partition-level information has the important advantage that is >> allows a query engine to come up with best estimate for *current query*, >> taking the query effective predicates into account and pruning the >> information. >> So sooner or later we will want to add min/max information to partition >> stats. (is it WIP already?) >> Since partition stats are Parquet, deriving table-level min/max >> information from partition-level min/max information is as simple as >> reading the Parquet file footer. >> >> > Given that we already maintain tabel-level NDV, it seems consistent to >> include corresponding table-level stats for min/max values. >> >> That's a reasonable thinking and consistency is important. >> There is, however, an important difference between NDV and min/max. The >> table-level NDV cannot be easily aggregated from partition-level NDV, >> because it quickly turns out to be CPU intensive process. >> Keeping partition-level NDV information (number, and a sketch) is a good >> idea, but it's suitable only for queries which read one to few partitions. >> Queries reading large number of partitions would still prefer to use >> table-level NDV information (if available). >> >> Best >> Piotr >> >> >> >> >> >> >> >> >> >> On Wed, 7 Aug 2024 at 05:07, huaxin gao <huaxin.ga...@gmail.com> wrote: >> >>> Thanks Alexander, Xianjin, Gang and Anton for your valuable insights! >>> >>> Regarding deriving min/max values on the fly, I currently don't have a >>> good algorithm. I rely on iterating through FileScanTask objects in memory >>> to aggregate results, which leads me to favor pre calculating min/max >>> values. >>> >>> I have a slight preference for table-level stats over partition-level >>> stats. Given that we already maintain tabel-level NDV, it seems consistent >>> to include corresponding table-level stats for min/max values. When >>> computing the table-level stats, if partition filters are applied, we have >>> already filtered out unwanted partitions, which, in my view, does not give >>> partition-level stats an advantage in this context. For >>> incremental updates, even with table level stats, I think we can still have >>> a way to calculate only the new or updated partitions and then update the >>> table level stats. However, I am open to other solutions. >>> >>> Thanks, >>> Huaxin >>> >>> >>> On Tue, Aug 6, 2024 at 10:40 AM Anton Okolnychyi <aokolnyc...@gmail.com> >>> wrote: >>> >>>> I'd like to entertain the idea of deriving min/max values on the fly to >>>> understand our baseline. What will the algorithm for that look like? I >>>> assume the naive approach will be to keep min/max stats for selected >>>> columns while planning (as opposed to discarding them upon filtering) and >>>> then iterate through FileScanTask objects in memory to aggregate the >>>> results? >>>> >>>> - Anton >>>> >>>> вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише: >>>> >>>>> Just give my two cents. Not all tables have partition definition and >>>>> table-level stats would >>>>> benefit these tables. In addition, NDV might not be easily populated >>>>> from partition-level >>>>> statistics. >>>>> >>>>> Thanks, >>>>> Gang >>>>> >>>>> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote: >>>>> >>>>>> Thanks for raising the discussion Huaxin. >>>>>> >>>>>> I also think partition-level statistics file(s) are more useful and >>>>>> has advantage over table-level stats. For instance: >>>>>> 1. It would be straight forward to support incremental stats >>>>>> computing for large tables: by recalculating new or updated partitions >>>>>> only >>>>>> 2. Table level stats could be built from partition level stats >>>>>> easily. It might take some additional time to merge though. >>>>>> 3. SparkScan usually involve parts of partitions if `partition >>>>>> filter` is involved, it would be more accurate to infer the stats from >>>>>> the >>>>>> selected partitions only >>>>>> >>>>>> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> >>>>>> wrote: >>>>>> >>>>>> Thanks for starting this thread Huaxin, >>>>>> >>>>>> The existing statistics, on a per data file basis, are definitely too >>>>>> granular for use in planning/analysis time query optimizations. >>>>>> It's worked so far, as tables have been relatively small, but from >>>>>> what I've seen in the Trino community it is starting to be a problem for >>>>>> some. >>>>>> >>>>>> However, I'm not sure that rolling the stats all the way up to the >>>>>> table level is what we want to add next. >>>>>> Would extending the partition stats to include some of the data >>>>>> present for data files be possible? >>>>>> To me, it seems like doing some amount of derivation at query time is >>>>>> okay, as long as the time it takes to do the derivation doesn't increase >>>>>> significantly as the table gets larger. >>>>>> >>>>>> Partition level stats also have the advantage of being able to >>>>>> provide more accurate estimates for queries with filters on the partition >>>>>> columns. >>>>>> >>>>>> Looking forward to talking more about the project, >>>>>> Alex Jo >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks, Samrose and Piotr, for the discussion! This issue is not >>>>>>> addressed by the partition statistics feature. What we need are table >>>>>>> level >>>>>>> stats. >>>>>>> >>>>>>> Given that our primary goal in collecting statistics is for >>>>>>> performance optimization, I believe it's not a good approach to derive >>>>>>> these statistics at query execution time. That's why I propose saving >>>>>>> these >>>>>>> metrics in the table-level stats file. I am thinking of reusing the >>>>>>> existing aggregate pushdown mechanism to compute min, max and null >>>>>>> counts. >>>>>>> We currently have MinAggregate >>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>, >>>>>>> MaxAggregate >>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>, >>>>>>> CountStar >>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>, >>>>>>> and CountNotNull >>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>. >>>>>>> Null counts can be derived from CountStar and CountNotNull. >>>>>>> >>>>>>> Thanks, >>>>>>> Huaxin >>>>>>> >>>>>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen < >>>>>>> piotr.findei...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> First of all, thank you Huaxin for raising this topic. It's >>>>>>>> important for Spark, but also for Trino. >>>>>>>> >>>>>>>> Min, max, and null counts can be derived from manifests. >>>>>>>> I am not saying that a query engine should derive them from >>>>>>>> manifests at query time, but it definitely can. >>>>>>>> If we want to pull min, max, and null counts table-level summary >>>>>>>> into stats Puffin file (which probably makes sense), my concern would >>>>>>>> be >>>>>>>> what the exact mechanics for that should be. NDVs need to be derived >>>>>>>> from >>>>>>>> the data files, but maybe we can do something smarter for min, max, and >>>>>>>> null counts. >>>>>>>> >>>>>>>> Best, >>>>>>>> Piotr >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Isn't this addressed by the partition statistics feature, or do >>>>>>>>> you want to have one row for the entire table? >>>>>>>>> >>>>>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I would like to initiate a discussion on implementing a >>>>>>>>>> table-level statistics file to store column statistics, specifically >>>>>>>>>> min, >>>>>>>>>> max, and null counts. The original discussion can be found in this >>>>>>>>>> Slack >>>>>>>>>> thread: >>>>>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779 >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> In Spark 3.4, I introduced Column Statistics >>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> >>>>>>>>>> within the Statistics interface >>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, >>>>>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a >>>>>>>>>> result, Spark can utilize the column statistics for cost-based >>>>>>>>>> optimization >>>>>>>>>> (CBO), including join reordering. >>>>>>>>>> >>>>>>>>>> Here’s how the process operates: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Use the ANALYZE TABLE command to compute column statistics >>>>>>>>>> and store them appropriately. The SQL syntax is: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… >>>>>>>>>> >>>>>>>>>> This command will trigger a ComputeTableStatsSparkAction to >>>>>>>>>> compute a data-sketch for NDV (number of distinct values) and saves >>>>>>>>>> this in >>>>>>>>>> the StatisticsFile. Here is the ComputeTableStatsAction PR >>>>>>>>>> <https://github.com/apache/iceberg/pull/10288> >>>>>>>>>> >>>>>>>>>> Additionally, the ANALYZE TABLE command needs to calculate >>>>>>>>>> table-level min, max, and null counts for columns like col1 and col2 >>>>>>>>>> using >>>>>>>>>> the manifest files. These table-level statistics are then saved. >>>>>>>>>> Currently, >>>>>>>>>> we do not have a designated location to store these table-level >>>>>>>>>> column >>>>>>>>>> statistics. The StatisticsFile >>>>>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, >>>>>>>>>> being tightly coupled with the Puffin file, does not seem to be a >>>>>>>>>> suitable >>>>>>>>>> repository for these column stats. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - If Spark's CBO is on and >>>>>>>>>> SparkSQLProperties.REPORT_COLUMN_STATS >>>>>>>>>> >>>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> >>>>>>>>>> is enabled, NDV will be retrieved from the StatisticsFile, and the >>>>>>>>>> min/max/null counts will also be retrieved from a table-level >>>>>>>>>> column >>>>>>>>>> statistics file. These statistics will be utilized to construct >>>>>>>>>> the >>>>>>>>>> Statistics object in estimateStatistics >>>>>>>>>> >>>>>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. >>>>>>>>>> Spark then employs the statistics returned from >>>>>>>>>> estimateStatistics >>>>>>>>>> >>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> >>>>>>>>>> for cost-based optimization (CBO). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Please share your thoughts on this. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Huaxin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>