Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

godfrey he Mon, 23 May 2022 02:16:40 -0700

Hi, Jark

Thanks for the feedback.


> 1) All the ability interfaces begin with "Supports" instead of "Support".
+1

> The "connect" word should be "collect"?
Yes, it's a typo.

> CatalogStatistics
Yes, we should use TableStats.
I forgot that TableStats and ColumnStats have been ported to the API module.

> What's the difference between them?
table.optimizer.source.collect-statistics-enabled is used for all collectors,
while source.statistics-type is file base connectors.
It may take a long time to get the detailed statistics,
but may be the file size (will be introduced later) is enough.

> IMO, we should also support Hive source as well in this FLIP.
+1

Best,
Godfrey

Jark Wu <imj...@gmail.com> 于2022年5月20日周五 12:04写道：
>
> Hi Godfrey,
>
> I just left some comments here:
>
> 1) SupportStatisticReport => SupportsStatisticReport
> All the ability interfaces begin with "Supports" instead of "Support".
>
> 2) table.optimizer.source.connect-statistics-enabled
> The "connect" word should be "collect"?
>
> 3) CatalogStatistics
> I was a little confused when I first saw the name. I thought it reports
> stats for a catalog...
> Why not use "TableStats" which already wraps "ColumnStats" in it and is a
> public API as well?
>
> 4) source.statistics-type
> vs table.optimizer.source.collect-statistics-enabled
> What's the difference between them? It seems that they are both used to
> enable or disable reporting stats.
>
> 5) "Which connectors and formats will be supported by default?"
> IMO, we should also support Hive source as well in this FLIP.
> Hive source is more widely used than Filesystem connector.
>
> Best,
> Jark
>
>
>
>
> On Tue, 17 May 2022 at 10:52, Jingsong Li <jingsongl...@gmail.com> wrote:
>
> > Hi Godfrey,
> >
> > Thanks for your reply.
> >
> > Sounds good to me.
> >
> > > I think we should also introduce a config option
> >
> > We can add this option to the FLIP. I prefer a option for
> > FileSystemConnector, maybe a enum.
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 17, 2022 at 10:31 AM godfrey he <godfre...@gmail.com> wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the feedback.
> > >
> > >
> > > >One concern I have is that we read the footer for each file, and this
> > may
> > > >be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way
> > > yes, if there are thousands of orc/parquet files, it may take a long
> > time.
> > > So we can introduce a config option to let the user choose the
> > > granularity of the statistics.
> > > But the SIZE will not be introduced, because the planner does not use
> > > the file size statistics now.
> > > We can introduce once file size statistics is introduce in the future.
> > > I think we should also introduce a config option to enable/disable
> > > SupportStatisticReport,
> > > because it's a heavy operation for some connectors in some cases.
> > >
> > > > is the filter pushdown already happening at
> > > > this time?
> > > That's a good point. Currently, the filter push down is after partition
> > > pruning
> > > to prevent the filter push down rule from consuming the partition
> > > predicates.
> > > The statistics will be set to unknown if filter is pushed down now.
> > > To combine them all, we can create an optimization program after filter
> > > push
> > > down program to collect the statistics. This could avoid collecting
> > > statistics multiple times.
> > >
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jingsong Li <jingsongl...@gmail.com> 于2022年5月13日周五 22:44写道：
> > > >
> > > > Thank Godfrey for driving.
> > > >
> > > > Looks very good~ This will undoubtedly greatly enhance the various
> > batch
> > > > mode connectors.
> > > >
> > > > I left some comments:
> > > >
> > > > ## FileBasedStatisticsReportableDecodingFormat
> > > >
> > > > One concern I have is that we read the footer for each file, and this
> > may
> > > > be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way, e.g.
> > > > - No statistics are collected for files by default.
> > > > - SIZE: Generate statistics based on file Size, get the size of the
> > file
> > > > only with access to the master of the FileSystem.
> > > > - DETAILED: Get the complete statistics by format, possibly by
> > accessing
> > > > the footer of the file.
> > > >
> > > > ## When use the statistics reported by connector
> > > >
> > > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> > the
> > > > statistics should also be updated.
> > > >
> > > > I understand that we definitely need to use reporter after the
> > partition
> > > > prune, but another question: is the filter pushdown already happening
> > at
> > > > this time?
> > > > Can we make sure that in the following three cases, both the filter
> > > > pushdown and the partition prune happen before the stats reporting.
> > > > - only partition prune happens
> > > > - only filter pushdown happens
> > > > - both filter pushdown and partition prune happen
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <godfre...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > SupportStatisticReport
> > > > > to support reporting statistics from source connectors.
> > > > >
> > > > > Statistics are one of the most important inputs to the optimizer.
> > > > > Accurate and complete statistics allows the optimizer to be more
> > > powerful.
> > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > while many Connectors have the ability to provide statistics, e.g.
> > > > > FileSystem.
> > > > > In production, we find many tables in Catalog do not have any
> > > statistics.
> > > > > As a result, the optimizer can't generate better execution plans,
> > > > > especially for Batch jobs.
> > > > >
> > > > > There are two approaches to enhance statistics for the planner,
> > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > the analyzed result to the catalog, another is to introduce a new
> > > > > connector interface
> > > > > which allows the connector itself to report statistics directly to
> > the
> > > > > planner.
> > > > > The second one is a supplement to the catalog statistics.
> > > > >
> > > > > Here, we will discuss the second approach. Compared to the first one,
> > > > > the second one is to get statistics in real time, no need to run an
> > > > > analysis job for each table. This could help improve the user
> > > > > experience.
> > > > > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> > > > >
> > > > > You can find more details in FLIP-231 document[1]. Looking forward to
> > > > > your feedback.
> > > > >
> > > > > [1]
> > > > >
> > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
> > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > >
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Reply via email to