Hi, everyone.

Thanks for all the inputs.
Since there is no feedback any more, I will start the vote tomorrow.

Best,
Godfrey

godfrey he <godfre...@gmail.com> 于2022年6月1日周三 13:30写道:
>
> Hi, Jing.
>
> Thanks for the suggestion, I have updated the doc and
> will continue to optimize the code in subsequent PR.
>
> Best,
> Godfrey
>
> Jing Ge <j...@ververica.com> 于2022年6月1日周三 04:41写道:
> >
> > Hi Godfrey,
> >
> > Thanks for clarifying it. I personally prefer the new change you suggested.
> >
> > Would you please help to understand one more thing? The else if
> > (filterPushDownSpec != null) branch is the only branch that doesn't have to
> > check if the newTableStat has been calculated previously. The reason might
> > be, after the filter has been pushed down to the table source,
> > ((SupportStatisticReport) tableSource).reportStatistics() will return a new
> > TableStats, which turns out the newTableStat has to be re-computed for each
> > filter push down. In this case, it might be good to add it into the FLIP
> > description. Otherwise, We could also add optimization for this branch to
> > avoid re-computing the table statistics.
> >
> > NIT: There are many conditions in the nested if-else statements. In order
> > to improve the readability (and maintainability in the future), we could
> > consider moving up some common checks like collectStatEnabled or
> > tableSource instanceof SupportStatisticReport, e.g.:
> >
> >  private LogicalTableScan collectStatistics(LogicalTableScan scan) {
> >       ......
> >      FlinkStatistic newStatistic = FlinkStatistic.builder()
> > .statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
> > .build();
> >      return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
> > scan.getHints(), table.copy(newStatistic));
> > }
> >
> > private TableStats refreshTableStat(boolean collectStatEnabled,
> > TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
> > FilterPushDownSpec filterPushDownSpec) {
> >
> >      if (!collectStatEnabled)   return null;
> >
> >      if (!(table.tableSource() instanceof SupportStatisticReport)) return
> > null;
> >
> >      SupportStatisticReport tableSource =
> > (SupportStatisticReport)table.tableSource();
> >
> >       if (filterPushDownSpec != null) {
> >              // filter push down, no matter  partition push down or not
> >              return  tableSource.reportStatistics();
> >      } else {
> >          if (partitionPushDownSpec != null) {
> >         // partition push down, while filter not
> >          } else {
> >          // no partition and filter push down
> >              return table.getStatistic().getTableStats() ==
> > TableStats.UNKNOWN ? tableSource.reportStatistics() :
> > table.getStatistic().getTableStats();
> >         }
> >      }
> > }
> >
> > This just improved a little bit without introducing some kind of
> > Action/Performer interface with many subclasses and factory class to get
> > rid of some if-else statements, which could optionally be the next step
> > provement in the future.
> >
> > Best regards,
> > Jing
> >
> > On Tue, May 31, 2022 at 3:42 PM godfrey he <godfre...@gmail.com> wrote:
> >
> > > Hi Jark and Jing,
> > >
> > > +1 to use "report" instead of "collect".
> > >
> > > >  // only filter push down (*the description means
> > > > partitionPushDownSpec == null but misses the case of
> > > > partitionPushDownSpec != null*)
> > >
> > > `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> > > this branch is only consider that the partition is partition is pushed
> > > down,
> > > but no filter is push down. The planner will collect the statistics
> > > from catalog first,
> > > and then try to collect the statistics from collectors if the catalog
> > > statistics is unknown.
> > >
> > > `else if (filterPushDownSpec != null)` this branch means  whether
> > > the partitionPushDownSpec is null or not, the planner will collect
> > > statistics from
> > > collectors only, because the catalog do not support get statistics with
> > > filters.
> > >
> > > `else if (collectStatEnabled
> > >                 && (table.getStatistic().getTableStats() ==
> > > TableStats.UNKNOWN)
> > >                 && tableSource instanceof SupportStatisticReport)`
> > > this branch means no partition and no filter are pushed down.
> > >
> > > or we can change the pseudocode to:
> > >  if (filterPushDownSpec != null) {
> > >    // filter push down, no mater  partition push down or not
> > > } else {
> > >     if (partitionPushDownSpec != null) {
> > >         // partition push down, while filter not
> > >      } else {
> > >          // no partition and filter push down
> > >      }
> > > }
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jing Ge <j...@ververica.com> 于2022年5月29日周日 08:17写道:
> > > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for driving this FLIP.  It looks really good! Looking forward to
> > > it!
> > > >
> > > > If I am not mistaken, partition pruning could also happen in the
> > > following
> > > > pseudocode condition block:
> > > >
> > > > else if (filterPushDownSpec != null) {
> > > >             // only filter push down (*the description means
> > > > partitionPushDownSpec == null but misses the case of
> > > > partitionPushDownSpec != null*)
> > > >
> > > >             // the catalog do not support get statistics with filters,
> > > >             // so only call reportStatistics method if needed
> > > >             if (collectStatEnabled && tableSource instanceof
> > > > SupportStatisticReport) {
> > > >                 newTableStat = ((SupportStatisticReport)
> > > > tableSource).reportStatistics();
> > > >             }
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Jing
> > > >
> > > >
> > > > On Sat, May 28, 2022 at 5:09 PM Jark Wu <imj...@gmail.com> wrote:
> > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > It seems that the "SupportStatisticReport" interface name and
> > > > >  "table.optimizer.source.connect-statistics-enabled" option name is 
> > > > > not
> > > > > updated in the FLIP.
> > > > >
> > > > > Besides, in the terms of the option name, the meaning of
> > > > > "source.statistics-type"
> > > > > is not very straightforward and clean to me. Maybe
> > > > > "source.report-statistics" = "none/all/file-size"
> > > > > would be better.
> > > > >
> > > > > We can also change "table.optimizer.source.connect-statistics-enabled"
> > > to
> > > > > "table.optimizer.source.report-statistics-enabled" for alignment and
> > > > >  it's clear that one for fine-grained and one for coarse-grained.
> > > > >
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Fri, 27 May 2022 at 22:58, godfrey he <godfre...@gmail.com> wrote:
> > > > >
> > > > > > Hi, everyone.
> > > > > >
> > > > > > Thanks for all the inputs.
> > > > > > If there is no more feedback, I think we can start the vote next
> > > monday.
> > > > > >
> > > > > > Best,
> > > > > > Godfrey
> > > > > >
> > > > > > Martijn Visser <martijnvis...@apache.org> 于2022年5月25日周三 19:46写道:
> > > > > > >
> > > > > > > Hi Godfrey,
> > > > > > >
> > > > > > > Thanks for creating the FLIP. I have no comments.
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Martijn
> > > > > > >
> > > > > > >
> > > > > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <jingsongl...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Godfrey,
> > > > > > > >
> > > > > > > > Thanks for your reply.
> > > > > > > >
> > > > > > > > Sounds good to me.
> > > > > > > >
> > > > > > > > > I think we should also introduce a config option
> > > > > > > >
> > > > > > > > We can add this option to the FLIP. I prefer a option for
> > > > > > > > FileSystemConnector, maybe a enum.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong
> > > > > > > >
> > > > > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfre...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jingsong,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >One concern I have is that we read the footer for each file,
> > > and
> > > > > > this
> > > > > > > > may
> > > > > > > > > >be a bit costly in some cases. Is it possible for us to have
> > > some
> > > > > > > > > > hierarchical way
> > > > > > > > > yes, if there are thousands of orc/parquet files, it may take 
> > > > > > > > > a
> > > > > long
> > > > > > > > time.
> > > > > > > > > So we can introduce a config option to let the user choose the
> > > > > > > > > granularity of the statistics.
> > > > > > > > > But the SIZE will not be introduced, because the planner does
> > > not
> > > > > use
> > > > > > > > > the file size statistics now.
> > > > > > > > > We can introduce once file size statistics is introduce in the
> > > > > > future.
> > > > > > > > > I think we should also introduce a config option to
> > > enable/disable
> > > > > > > > > SupportStatisticReport,
> > > > > > > > > because it's a heavy operation for some connectors in some
> > > cases.
> > > > > > > > >
> > > > > > > > > > is the filter pushdown already happening at
> > > > > > > > > > this time?
> > > > > > > > > That's a good point. Currently, the filter push down is after
> > > > > > partition
> > > > > > > > > pruning
> > > > > > > > > to prevent the filter push down rule from consuming the
> > > partition
> > > > > > > > > predicates.
> > > > > > > > > The statistics will be set to unknown if filter is pushed down
> > > now.
> > > > > > > > > To combine them all, we can create an optimization program
> > > after
> > > > > > filter
> > > > > > > > > push
> > > > > > > > > down program to collect the statistics. This could avoid
> > > collecting
> > > > > > > > > statistics multiple times.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Godfrey
> > > > > > > > >
> > > > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2022年5月13日周五 22:44写道:
> > > > > > > > > >
> > > > > > > > > > Thank Godfrey for driving.
> > > > > > > > > >
> > > > > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > > > > various
> > > > > > > > batch
> > > > > > > > > > mode connectors.
> > > > > > > > > >
> > > > > > > > > > I left some comments:
> > > > > > > > > >
> > > > > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > > > > >
> > > > > > > > > > One concern I have is that we read the footer for each file,
> > > and
> > > > > > this
> > > > > > > > may
> > > > > > > > > > be a bit costly in some cases. Is it possible for us to have
> > > some
> > > > > > > > > > hierarchical way, e.g.
> > > > > > > > > > - No statistics are collected for files by default.
> > > > > > > > > > - SIZE: Generate statistics based on file Size, get the size
> > > of
> > > > > the
> > > > > > > > file
> > > > > > > > > > only with access to the master of the FileSystem.
> > > > > > > > > > - DETAILED: Get the complete statistics by format, possibly
> > > by
> > > > > > > > accessing
> > > > > > > > > > the footer of the file.
> > > > > > > > > >
> > > > > > > > > > ## When use the statistics reported by connector
> > > > > > > > > >
> > > > > > > > > > > When partitions are pruned by
> > > > > > PushPartitionIntoTableSourceScanRule,
> > > > > > > > the
> > > > > > > > > > statistics should also be updated.
> > > > > > > > > >
> > > > > > > > > > I understand that we definitely need to use reporter after
> > > the
> > > > > > > > partition
> > > > > > > > > > prune, but another question: is the filter pushdown already
> > > > > > happening
> > > > > > > > at
> > > > > > > > > > this time?
> > > > > > > > > > Can we make sure that in the following three cases, both the
> > > > > filter
> > > > > > > > > > pushdown and the partition prune happen before the stats
> > > > > reporting.
> > > > > > > > > > - only partition prune happens
> > > > > > > > > > - only filter pushdown happens
> > > > > > > > > > - both filter pushdown and partition prune happen
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong
> > > > > > > > > >
> > > > > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <
> > > godfre...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > > > > SupportStatisticReport
> > > > > > > > > > > to support reporting statistics from source connectors.
> > > > > > > > > > >
> > > > > > > > > > > Statistics are one of the most important inputs to the
> > > > > optimizer.
> > > > > > > > > > > Accurate and complete statistics allows the optimizer to 
> > > > > > > > > > > be
> > > > > more
> > > > > > > > > powerful.
> > > > > > > > > > > Currently, the statistics of Flink SQL come from Catalog
> > > only,
> > > > > > > > > > > while many Connectors have the ability to provide
> > > statistics,
> > > > > > e.g.
> > > > > > > > > > > FileSystem.
> > > > > > > > > > > In production, we find many tables in Catalog do not have
> > > any
> > > > > > > > > statistics.
> > > > > > > > > > > As a result, the optimizer can't generate better execution
> > > > > plans,
> > > > > > > > > > > especially for Batch jobs.
> > > > > > > > > > >
> > > > > > > > > > > There are two approaches to enhance statistics for the
> > > planner,
> > > > > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will
> > > write
> > > > > > > > > > > the analyzed result to the catalog, another is to
> > > introduce a
> > > > > new
> > > > > > > > > > > connector interface
> > > > > > > > > > > which allows the connector itself to report statistics
> > > directly
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > planner.
> > > > > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > > > > >
> > > > > > > > > > > Here, we will discuss the second approach. Compared to the
> > > > > first
> > > > > > one,
> > > > > > > > > > > the second one is to get statistics in real time, no need
> > > to
> > > > > run
> > > > > > an
> > > > > > > > > > > analysis job for each table. This could help improve the
> > > user
> > > > > > > > > > > experience.
> > > > > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in 
> > > > > > > > > > > other
> > > > > > FLIP.)
> > > > > > > > > > >
> > > > > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > > > > forward to
> > > > > > > > > > > your feedback.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
> > > > > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Godfrey
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >

Reply via email to