Hi, Jing. Thanks for the suggestion, I have updated the doc and will continue to optimize the code in subsequent PR.
Best, Godfrey Jing Ge <j...@ververica.com> 于2022年6月1日周三 04:41写道: > > Hi Godfrey, > > Thanks for clarifying it. I personally prefer the new change you suggested. > > Would you please help to understand one more thing? The else if > (filterPushDownSpec != null) branch is the only branch that doesn't have to > check if the newTableStat has been calculated previously. The reason might > be, after the filter has been pushed down to the table source, > ((SupportStatisticReport) tableSource).reportStatistics() will return a new > TableStats, which turns out the newTableStat has to be re-computed for each > filter push down. In this case, it might be good to add it into the FLIP > description. Otherwise, We could also add optimization for this branch to > avoid re-computing the table statistics. > > NIT: There are many conditions in the nested if-else statements. In order > to improve the readability (and maintainability in the future), we could > consider moving up some common checks like collectStatEnabled or > tableSource instanceof SupportStatisticReport, e.g.: > > private LogicalTableScan collectStatistics(LogicalTableScan scan) { > ...... > FlinkStatistic newStatistic = FlinkStatistic.builder() > .statistic(table.getStatistic()) .tableStats(refreshTableStat(...)) > .build(); > return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(), > scan.getHints(), table.copy(newStatistic)); > } > > private TableStats refreshTableStat(boolean collectStatEnabled, > TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec, > FilterPushDownSpec filterPushDownSpec) { > > if (!collectStatEnabled) return null; > > if (!(table.tableSource() instanceof SupportStatisticReport)) return > null; > > SupportStatisticReport tableSource = > (SupportStatisticReport)table.tableSource(); > > if (filterPushDownSpec != null) { > // filter push down, no matter partition push down or not > return tableSource.reportStatistics(); > } else { > if (partitionPushDownSpec != null) { > // partition push down, while filter not > } else { > // no partition and filter push down > return table.getStatistic().getTableStats() == > TableStats.UNKNOWN ? tableSource.reportStatistics() : > table.getStatistic().getTableStats(); > } > } > } > > This just improved a little bit without introducing some kind of > Action/Performer interface with many subclasses and factory class to get > rid of some if-else statements, which could optionally be the next step > provement in the future. > > Best regards, > Jing > > On Tue, May 31, 2022 at 3:42 PM godfrey he <godfre...@gmail.com> wrote: > > > Hi Jark and Jing, > > > > +1 to use "report" instead of "collect". > > > > > // only filter push down (*the description means > > > partitionPushDownSpec == null but misses the case of > > > partitionPushDownSpec != null*) > > > > `if (partitionPushDownSpec != null && filterPushDownSpec == null)` > > this branch is only consider that the partition is partition is pushed > > down, > > but no filter is push down. The planner will collect the statistics > > from catalog first, > > and then try to collect the statistics from collectors if the catalog > > statistics is unknown. > > > > `else if (filterPushDownSpec != null)` this branch means whether > > the partitionPushDownSpec is null or not, the planner will collect > > statistics from > > collectors only, because the catalog do not support get statistics with > > filters. > > > > `else if (collectStatEnabled > > && (table.getStatistic().getTableStats() == > > TableStats.UNKNOWN) > > && tableSource instanceof SupportStatisticReport)` > > this branch means no partition and no filter are pushed down. > > > > or we can change the pseudocode to: > > if (filterPushDownSpec != null) { > > // filter push down, no mater partition push down or not > > } else { > > if (partitionPushDownSpec != null) { > > // partition push down, while filter not > > } else { > > // no partition and filter push down > > } > > } > > > > Best, > > Godfrey > > > > Jing Ge <j...@ververica.com> 于2022年5月29日周日 08:17写道: > > > > > > Hi Godfrey, > > > > > > Thanks for driving this FLIP. It looks really good! Looking forward to > > it! > > > > > > If I am not mistaken, partition pruning could also happen in the > > following > > > pseudocode condition block: > > > > > > else if (filterPushDownSpec != null) { > > > // only filter push down (*the description means > > > partitionPushDownSpec == null but misses the case of > > > partitionPushDownSpec != null*) > > > > > > // the catalog do not support get statistics with filters, > > > // so only call reportStatistics method if needed > > > if (collectStatEnabled && tableSource instanceof > > > SupportStatisticReport) { > > > newTableStat = ((SupportStatisticReport) > > > tableSource).reportStatistics(); > > > } > > > > > > > > > Best regards, > > > > > > Jing > > > > > > > > > On Sat, May 28, 2022 at 5:09 PM Jark Wu <imj...@gmail.com> wrote: > > > > > > > Hi Godfrey, > > > > > > > > It seems that the "SupportStatisticReport" interface name and > > > > "table.optimizer.source.connect-statistics-enabled" option name is not > > > > updated in the FLIP. > > > > > > > > Besides, in the terms of the option name, the meaning of > > > > "source.statistics-type" > > > > is not very straightforward and clean to me. Maybe > > > > "source.report-statistics" = "none/all/file-size" > > > > would be better. > > > > > > > > We can also change "table.optimizer.source.connect-statistics-enabled" > > to > > > > "table.optimizer.source.report-statistics-enabled" for alignment and > > > > it's clear that one for fine-grained and one for coarse-grained. > > > > > > > > > > > > Best, > > > > Jark > > > > > > > > On Fri, 27 May 2022 at 22:58, godfrey he <godfre...@gmail.com> wrote: > > > > > > > > > Hi, everyone. > > > > > > > > > > Thanks for all the inputs. > > > > > If there is no more feedback, I think we can start the vote next > > monday. > > > > > > > > > > Best, > > > > > Godfrey > > > > > > > > > > Martijn Visser <martijnvis...@apache.org> 于2022年5月25日周三 19:46写道: > > > > > > > > > > > > Hi Godfrey, > > > > > > > > > > > > Thanks for creating the FLIP. I have no comments. > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Martijn > > > > > > > > > > > > > > > > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <jingsongl...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi Godfrey, > > > > > > > > > > > > > > Thanks for your reply. > > > > > > > > > > > > > > Sounds good to me. > > > > > > > > > > > > > > > I think we should also introduce a config option > > > > > > > > > > > > > > We can add this option to the FLIP. I prefer a option for > > > > > > > FileSystemConnector, maybe a enum. > > > > > > > > > > > > > > Best, > > > > > > > Jingsong > > > > > > > > > > > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfre...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > > > > > Thanks for the feedback. > > > > > > > > > > > > > > > > > > > > > > > > >One concern I have is that we read the footer for each file, > > and > > > > > this > > > > > > > may > > > > > > > > >be a bit costly in some cases. Is it possible for us to have > > some > > > > > > > > > hierarchical way > > > > > > > > yes, if there are thousands of orc/parquet files, it may take a > > > > long > > > > > > > time. > > > > > > > > So we can introduce a config option to let the user choose the > > > > > > > > granularity of the statistics. > > > > > > > > But the SIZE will not be introduced, because the planner does > > not > > > > use > > > > > > > > the file size statistics now. > > > > > > > > We can introduce once file size statistics is introduce in the > > > > > future. > > > > > > > > I think we should also introduce a config option to > > enable/disable > > > > > > > > SupportStatisticReport, > > > > > > > > because it's a heavy operation for some connectors in some > > cases. > > > > > > > > > > > > > > > > > is the filter pushdown already happening at > > > > > > > > > this time? > > > > > > > > That's a good point. Currently, the filter push down is after > > > > > partition > > > > > > > > pruning > > > > > > > > to prevent the filter push down rule from consuming the > > partition > > > > > > > > predicates. > > > > > > > > The statistics will be set to unknown if filter is pushed down > > now. > > > > > > > > To combine them all, we can create an optimization program > > after > > > > > filter > > > > > > > > push > > > > > > > > down program to collect the statistics. This could avoid > > collecting > > > > > > > > statistics multiple times. > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > Godfrey > > > > > > > > > > > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2022年5月13日周五 22:44写道: > > > > > > > > > > > > > > > > > > Thank Godfrey for driving. > > > > > > > > > > > > > > > > > > Looks very good~ This will undoubtedly greatly enhance the > > > > various > > > > > > > batch > > > > > > > > > mode connectors. > > > > > > > > > > > > > > > > > > I left some comments: > > > > > > > > > > > > > > > > > > ## FileBasedStatisticsReportableDecodingFormat > > > > > > > > > > > > > > > > > > One concern I have is that we read the footer for each file, > > and > > > > > this > > > > > > > may > > > > > > > > > be a bit costly in some cases. Is it possible for us to have > > some > > > > > > > > > hierarchical way, e.g. > > > > > > > > > - No statistics are collected for files by default. > > > > > > > > > - SIZE: Generate statistics based on file Size, get the size > > of > > > > the > > > > > > > file > > > > > > > > > only with access to the master of the FileSystem. > > > > > > > > > - DETAILED: Get the complete statistics by format, possibly > > by > > > > > > > accessing > > > > > > > > > the footer of the file. > > > > > > > > > > > > > > > > > > ## When use the statistics reported by connector > > > > > > > > > > > > > > > > > > > When partitions are pruned by > > > > > PushPartitionIntoTableSourceScanRule, > > > > > > > the > > > > > > > > > statistics should also be updated. > > > > > > > > > > > > > > > > > > I understand that we definitely need to use reporter after > > the > > > > > > > partition > > > > > > > > > prune, but another question: is the filter pushdown already > > > > > happening > > > > > > > at > > > > > > > > > this time? > > > > > > > > > Can we make sure that in the following three cases, both the > > > > filter > > > > > > > > > pushdown and the partition prune happen before the stats > > > > reporting. > > > > > > > > > - only partition prune happens > > > > > > > > > - only filter pushdown happens > > > > > > > > > - both filter pushdown and partition prune happen > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Jingsong > > > > > > > > > > > > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he < > > godfre...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > I would like to open a discussion on FLIP-231: Introduce > > > > > > > > > > SupportStatisticReport > > > > > > > > > > to support reporting statistics from source connectors. > > > > > > > > > > > > > > > > > > > > Statistics are one of the most important inputs to the > > > > optimizer. > > > > > > > > > > Accurate and complete statistics allows the optimizer to be > > > > more > > > > > > > > powerful. > > > > > > > > > > Currently, the statistics of Flink SQL come from Catalog > > only, > > > > > > > > > > while many Connectors have the ability to provide > > statistics, > > > > > e.g. > > > > > > > > > > FileSystem. > > > > > > > > > > In production, we find many tables in Catalog do not have > > any > > > > > > > > statistics. > > > > > > > > > > As a result, the optimizer can't generate better execution > > > > plans, > > > > > > > > > > especially for Batch jobs. > > > > > > > > > > > > > > > > > > > > There are two approaches to enhance statistics for the > > planner, > > > > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will > > write > > > > > > > > > > the analyzed result to the catalog, another is to > > introduce a > > > > new > > > > > > > > > > connector interface > > > > > > > > > > which allows the connector itself to report statistics > > directly > > > > > to > > > > > > > the > > > > > > > > > > planner. > > > > > > > > > > The second one is a supplement to the catalog statistics. > > > > > > > > > > > > > > > > > > > > Here, we will discuss the second approach. Compared to the > > > > first > > > > > one, > > > > > > > > > > the second one is to get statistics in real time, no need > > to > > > > run > > > > > an > > > > > > > > > > analysis job for each table. This could help improve the > > user > > > > > > > > > > experience. > > > > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other > > > > > FLIP.) > > > > > > > > > > > > > > > > > > > > You can find more details in FLIP-231 document[1]. Looking > > > > > forward to > > > > > > > > > > your feedback. > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00& > > > > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Godfrey > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >