Thanks Alexander, Xianjin, Gang and Anton for your valuable insights! Regarding deriving min/max values on the fly, I currently don't have a good algorithm. I rely on iterating through FileScanTask objects in memory to aggregate results, which leads me to favor pre calculating min/max values.
I have a slight preference for table-level stats over partition-level stats. Given that we already maintain tabel-level NDV, it seems consistent to include corresponding table-level stats for min/max values. When computing the table-level stats, if partition filters are applied, we have already filtered out unwanted partitions, which, in my view, does not give partition-level stats an advantage in this context. For incremental updates, even with table level stats, I think we can still have a way to calculate only the new or updated partitions and then update the table level stats. However, I am open to other solutions. Thanks, Huaxin On Tue, Aug 6, 2024 at 10:40 AM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > I'd like to entertain the idea of deriving min/max values on the fly to > understand our baseline. What will the algorithm for that look like? I > assume the naive approach will be to keep min/max stats for selected > columns while planning (as opposed to discarding them upon filtering) and > then iterate through FileScanTask objects in memory to aggregate the > results? > > - Anton > > вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише: > >> Just give my two cents. Not all tables have partition definition and >> table-level stats would >> benefit these tables. In addition, NDV might not be easily populated from >> partition-level >> statistics. >> >> Thanks, >> Gang >> >> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote: >> >>> Thanks for raising the discussion Huaxin. >>> >>> I also think partition-level statistics file(s) are more useful and has >>> advantage over table-level stats. For instance: >>> 1. It would be straight forward to support incremental stats computing >>> for large tables: by recalculating new or updated partitions only >>> 2. Table level stats could be built from partition level stats easily. >>> It might take some additional time to merge though. >>> 3. SparkScan usually involve parts of partitions if `partition filter` >>> is involved, it would be more accurate to infer the stats from the selected >>> partitions only >>> >>> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> >>> wrote: >>> >>> Thanks for starting this thread Huaxin, >>> >>> The existing statistics, on a per data file basis, are definitely too >>> granular for use in planning/analysis time query optimizations. >>> It's worked so far, as tables have been relatively small, but from what >>> I've seen in the Trino community it is starting to be a problem for some. >>> >>> However, I'm not sure that rolling the stats all the way up to the table >>> level is what we want to add next. >>> Would extending the partition stats to include some of the data present >>> for data files be possible? >>> To me, it seems like doing some amount of derivation at query time is >>> okay, as long as the time it takes to do the derivation doesn't increase >>> significantly as the table gets larger. >>> >>> Partition level stats also have the advantage of being able to provide >>> more accurate estimates for queries with filters on the partition columns. >>> >>> Looking forward to talking more about the project, >>> Alex Jo >>> >>> >>> >>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> >>> wrote: >>> >>>> Thanks, Samrose and Piotr, for the discussion! This issue is not >>>> addressed by the partition statistics feature. What we need are table level >>>> stats. >>>> >>>> Given that our primary goal in collecting statistics is for performance >>>> optimization, I believe it's not a good approach to derive these statistics >>>> at query execution time. That's why I propose saving these metrics in the >>>> table-level stats file. I am thinking of reusing the existing aggregate >>>> pushdown mechanism to compute min, max and null counts. We currently have >>>> MinAggregate >>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>, >>>> MaxAggregate >>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>, >>>> CountStar >>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>, >>>> and CountNotNull >>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>. >>>> Null counts can be derived from CountStar and CountNotNull. >>>> >>>> Thanks, >>>> Huaxin >>>> >>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen < >>>> piotr.findei...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> First of all, thank you Huaxin for raising this topic. It's important >>>>> for Spark, but also for Trino. >>>>> >>>>> Min, max, and null counts can be derived from manifests. >>>>> I am not saying that a query engine should derive them from manifests >>>>> at query time, but it definitely can. >>>>> If we want to pull min, max, and null counts table-level summary into >>>>> stats Puffin file (which probably makes sense), my concern would be what >>>>> the exact mechanics for that should be. NDVs need to be derived from the >>>>> data files, but maybe we can do something smarter for min, max, and null >>>>> counts. >>>>> >>>>> Best, >>>>> Piotr >>>>> >>>>> >>>>> >>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com> >>>>> wrote: >>>>> >>>>>> Isn't this addressed by the partition statistics feature, or do you >>>>>> want to have one row for the entire table? >>>>>> >>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I would like to initiate a discussion on implementing a table-level >>>>>>> statistics file to store column statistics, specifically min, max, and >>>>>>> null >>>>>>> counts. The original discussion can be found in this Slack thread: >>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779 >>>>>>> . >>>>>>> >>>>>>> In Spark 3.4, I introduced Column Statistics >>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> >>>>>>> within the Statistics interface >>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, >>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a >>>>>>> result, Spark can utilize the column statistics for cost-based >>>>>>> optimization >>>>>>> (CBO), including join reordering. >>>>>>> >>>>>>> Here’s how the process operates: >>>>>>> >>>>>>> >>>>>>> - Use the ANALYZE TABLE command to compute column statistics and >>>>>>> store them appropriately. The SQL syntax is: >>>>>>> >>>>>>> >>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… >>>>>>> >>>>>>> This command will trigger a ComputeTableStatsSparkAction to compute >>>>>>> a data-sketch for NDV (number of distinct values) and saves this in the >>>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR >>>>>>> <https://github.com/apache/iceberg/pull/10288> >>>>>>> >>>>>>> Additionally, the ANALYZE TABLE command needs to calculate >>>>>>> table-level min, max, and null counts for columns like col1 and col2 >>>>>>> using >>>>>>> the manifest files. These table-level statistics are then saved. >>>>>>> Currently, >>>>>>> we do not have a designated location to store these table-level column >>>>>>> statistics. The StatisticsFile >>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, >>>>>>> being tightly coupled with the Puffin file, does not seem to be a >>>>>>> suitable >>>>>>> repository for these column stats. >>>>>>> >>>>>>> >>>>>>> >>>>>>> - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS >>>>>>> >>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> >>>>>>> is enabled, NDV will be retrieved from the StatisticsFile, and the >>>>>>> min/max/null counts will also be retrieved from a table-level column >>>>>>> statistics file. These statistics will be utilized to construct the >>>>>>> Statistics object in estimateStatistics >>>>>>> >>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. >>>>>>> Spark then employs the statistics returned from >>>>>>> estimateStatistics >>>>>>> >>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> >>>>>>> for cost-based optimization (CBO). >>>>>>> >>>>>>> >>>>>>> Please share your thoughts on this. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Huaxin >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>