Hi, Jark Thanks for the feedback.
> 1) All the ability interfaces begin with "Supports" instead of "Support". +1 > The "connect" word should be "collect"? Yes, it's a typo. > CatalogStatistics Yes, we should use TableStats. I forgot that TableStats and ColumnStats have been ported to the API module. > What's the difference between them? table.optimizer.source.collect-statistics-enabled is used for all collectors, while source.statistics-type is file base connectors. It may take a long time to get the detailed statistics, but may be the file size (will be introduced later) is enough. > IMO, we should also support Hive source as well in this FLIP. +1 Best, Godfrey Jark Wu <imj...@gmail.com> 于2022年5月20日周五 12:04写道: > > Hi Godfrey, > > I just left some comments here: > > 1) SupportStatisticReport => SupportsStatisticReport > All the ability interfaces begin with "Supports" instead of "Support". > > 2) table.optimizer.source.connect-statistics-enabled > The "connect" word should be "collect"? > > 3) CatalogStatistics > I was a little confused when I first saw the name. I thought it reports > stats for a catalog... > Why not use "TableStats" which already wraps "ColumnStats" in it and is a > public API as well? > > 4) source.statistics-type > vs table.optimizer.source.collect-statistics-enabled > What's the difference between them? It seems that they are both used to > enable or disable reporting stats. > > 5) "Which connectors and formats will be supported by default?" > IMO, we should also support Hive source as well in this FLIP. > Hive source is more widely used than Filesystem connector. > > Best, > Jark > > > > > On Tue, 17 May 2022 at 10:52, Jingsong Li <jingsongl...@gmail.com> wrote: > > > Hi Godfrey, > > > > Thanks for your reply. > > > > Sounds good to me. > > > > > I think we should also introduce a config option > > > > We can add this option to the FLIP. I prefer a option for > > FileSystemConnector, maybe a enum. > > > > Best, > > Jingsong > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfre...@gmail.com> wrote: > > > > > Hi Jingsong, > > > > > > Thanks for the feedback. > > > > > > > > > >One concern I have is that we read the footer for each file, and this > > may > > > >be a bit costly in some cases. Is it possible for us to have some > > > > hierarchical way > > > yes, if there are thousands of orc/parquet files, it may take a long > > time. > > > So we can introduce a config option to let the user choose the > > > granularity of the statistics. > > > But the SIZE will not be introduced, because the planner does not use > > > the file size statistics now. > > > We can introduce once file size statistics is introduce in the future. > > > I think we should also introduce a config option to enable/disable > > > SupportStatisticReport, > > > because it's a heavy operation for some connectors in some cases. > > > > > > > is the filter pushdown already happening at > > > > this time? > > > That's a good point. Currently, the filter push down is after partition > > > pruning > > > to prevent the filter push down rule from consuming the partition > > > predicates. > > > The statistics will be set to unknown if filter is pushed down now. > > > To combine them all, we can create an optimization program after filter > > > push > > > down program to collect the statistics. This could avoid collecting > > > statistics multiple times. > > > > > > > > > Best, > > > Godfrey > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2022年5月13日周五 22:44写道: > > > > > > > > Thank Godfrey for driving. > > > > > > > > Looks very good~ This will undoubtedly greatly enhance the various > > batch > > > > mode connectors. > > > > > > > > I left some comments: > > > > > > > > ## FileBasedStatisticsReportableDecodingFormat > > > > > > > > One concern I have is that we read the footer for each file, and this > > may > > > > be a bit costly in some cases. Is it possible for us to have some > > > > hierarchical way, e.g. > > > > - No statistics are collected for files by default. > > > > - SIZE: Generate statistics based on file Size, get the size of the > > file > > > > only with access to the master of the FileSystem. > > > > - DETAILED: Get the complete statistics by format, possibly by > > accessing > > > > the footer of the file. > > > > > > > > ## When use the statistics reported by connector > > > > > > > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule, > > the > > > > statistics should also be updated. > > > > > > > > I understand that we definitely need to use reporter after the > > partition > > > > prune, but another question: is the filter pushdown already happening > > at > > > > this time? > > > > Can we make sure that in the following three cases, both the filter > > > > pushdown and the partition prune happen before the stats reporting. > > > > - only partition prune happens > > > > - only filter pushdown happens > > > > - both filter pushdown and partition prune happen > > > > > > > > Best, > > > > Jingsong > > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <godfre...@gmail.com> > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I would like to open a discussion on FLIP-231: Introduce > > > > > SupportStatisticReport > > > > > to support reporting statistics from source connectors. > > > > > > > > > > Statistics are one of the most important inputs to the optimizer. > > > > > Accurate and complete statistics allows the optimizer to be more > > > powerful. > > > > > Currently, the statistics of Flink SQL come from Catalog only, > > > > > while many Connectors have the ability to provide statistics, e.g. > > > > > FileSystem. > > > > > In production, we find many tables in Catalog do not have any > > > statistics. > > > > > As a result, the optimizer can't generate better execution plans, > > > > > especially for Batch jobs. > > > > > > > > > > There are two approaches to enhance statistics for the planner, > > > > > one is to introduce the "ANALYZE TABLE" syntax which will write > > > > > the analyzed result to the catalog, another is to introduce a new > > > > > connector interface > > > > > which allows the connector itself to report statistics directly to > > the > > > > > planner. > > > > > The second one is a supplement to the catalog statistics. > > > > > > > > > > Here, we will discuss the second approach. Compared to the first one, > > > > > the second one is to get statistics in real time, no need to run an > > > > > analysis job for each table. This could help improve the user > > > > > experience. > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.) > > > > > > > > > > You can find more details in FLIP-231 document[1]. Looking forward to > > > > > your feedback. > > > > > > > > > > [1] > > > > > > > > > > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00& > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231 > > > > > > > > > > > > > > > Best, > > > > > Godfrey > > > > > > > > > >