Just give my two cents. Not all tables have partition definition and table-level stats would benefit these tables. In addition, NDV might not be easily populated from partition-level statistics.
Thanks, Gang On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote: > Thanks for raising the discussion Huaxin. > > I also think partition-level statistics file(s) are more useful and has > advantage over table-level stats. For instance: > 1. It would be straight forward to support incremental stats computing for > large tables: by recalculating new or updated partitions only > 2. Table level stats could be built from partition level stats easily. It > might take some additional time to merge though. > 3. SparkScan usually involve parts of partitions if `partition filter` is > involved, it would be more accurate to infer the stats from the selected > partitions only > > On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> wrote: > > Thanks for starting this thread Huaxin, > > The existing statistics, on a per data file basis, are definitely too > granular for use in planning/analysis time query optimizations. > It's worked so far, as tables have been relatively small, but from what > I've seen in the Trino community it is starting to be a problem for some. > > However, I'm not sure that rolling the stats all the way up to the table > level is what we want to add next. > Would extending the partition stats to include some of the data present > for data files be possible? > To me, it seems like doing some amount of derivation at query time is > okay, as long as the time it takes to do the derivation doesn't increase > significantly as the table gets larger. > > Partition level stats also have the advantage of being able to provide > more accurate estimates for queries with filters on the partition columns. > > Looking forward to talking more about the project, > Alex Jo > > > > On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> wrote: > >> Thanks, Samrose and Piotr, for the discussion! This issue is not >> addressed by the partition statistics feature. What we need are table level >> stats. >> >> Given that our primary goal in collecting statistics is for performance >> optimization, I believe it's not a good approach to derive these statistics >> at query execution time. That's why I propose saving these metrics in the >> table-level stats file. I am thinking of reusing the existing aggregate >> pushdown mechanism to compute min, max and null counts. We currently have >> MinAggregate >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>, >> MaxAggregate >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>, >> CountStar >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>, >> and CountNotNull >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>. >> Null counts can be derived from CountStar and CountNotNull. >> >> Thanks, >> Huaxin >> >> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <piotr.findei...@gmail.com> >> wrote: >> >>> Hi, >>> >>> First of all, thank you Huaxin for raising this topic. It's important >>> for Spark, but also for Trino. >>> >>> Min, max, and null counts can be derived from manifests. >>> I am not saying that a query engine should derive them from manifests at >>> query time, but it definitely can. >>> If we want to pull min, max, and null counts table-level summary into >>> stats Puffin file (which probably makes sense), my concern would be what >>> the exact mechanics for that should be. NDVs need to be derived from the >>> data files, but maybe we can do something smarter for min, max, and null >>> counts. >>> >>> Best, >>> Piotr >>> >>> >>> >>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com> >>> wrote: >>> >>>> Isn't this addressed by the partition statistics feature, or do you >>>> want to have one row for the entire table? >>>> >>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> >>>> wrote: >>>> >>>>> I would like to initiate a discussion on implementing a table-level >>>>> statistics file to store column statistics, specifically min, max, and >>>>> null >>>>> counts. The original discussion can be found in this Slack thread: >>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779 >>>>> . >>>>> >>>>> In Spark 3.4, I introduced Column Statistics >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> >>>>> within the Statistics interface >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, >>>>> enabling Iceberg to implement and report these metrics to Spark. As a >>>>> result, Spark can utilize the column statistics for cost-based >>>>> optimization >>>>> (CBO), including join reordering. >>>>> >>>>> Here’s how the process operates: >>>>> >>>>> >>>>> - Use the ANALYZE TABLE command to compute column statistics and >>>>> store them appropriately. The SQL syntax is: >>>>> >>>>> >>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… >>>>> >>>>> This command will trigger a ComputeTableStatsSparkAction to compute a >>>>> data-sketch for NDV (number of distinct values) and saves this in the >>>>> StatisticsFile. Here is the ComputeTableStatsAction PR >>>>> <https://github.com/apache/iceberg/pull/10288> >>>>> >>>>> Additionally, the ANALYZE TABLE command needs to calculate >>>>> table-level min, max, and null counts for columns like col1 and col2 using >>>>> the manifest files. These table-level statistics are then saved. >>>>> Currently, >>>>> we do not have a designated location to store these table-level column >>>>> statistics. The StatisticsFile >>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, >>>>> being tightly coupled with the Puffin file, does not seem to be a suitable >>>>> repository for these column stats. >>>>> >>>>> >>>>> >>>>> - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS >>>>> >>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> >>>>> is enabled, NDV will be retrieved from the StatisticsFile, and the >>>>> min/max/null counts will also be retrieved from a table-level column >>>>> statistics file. These statistics will be utilized to construct the >>>>> Statistics object in estimateStatistics >>>>> >>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. >>>>> Spark then employs the statistics returned from estimateStatistics >>>>> >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> >>>>> for cost-based optimization (CBO). >>>>> >>>>> >>>>> Please share your thoughts on this. >>>>> >>>>> Thanks, >>>>> >>>>> Huaxin >>>>> >>>>> >>>>> >>>>> >