Hi all,

I would like to open a discussion on FLIP-231:  Introduce SupportStatisticReport
to support reporting statistics from source connectors.

Statistics are one of the most important inputs to the optimizer.
Accurate and complete statistics allows the optimizer to be more powerful.
Currently, the statistics of Flink SQL come from Catalog only,
while many Connectors have the ability to provide statistics, e.g. FileSystem.
In production, we find many tables in Catalog do not have any statistics.
As a result, the optimizer can't generate better execution plans,
especially for Batch jobs.

There are two approaches to enhance statistics for the planner,
one is to introduce the "ANALYZE TABLE" syntax which will write
the analyzed result to the catalog, another is to introduce a new
connector interface
which allows the connector itself to report statistics directly to the planner.
The second one is a supplement to the catalog statistics.

Here, we will discuss the second approach. Compared to the first one,
the second one is to get statistics in real time, no need to run an
analysis job for each table. This could help improve the user
experience.
(We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)

You can find more details in FLIP-231 document[1]. Looking forward to
your feedback.

[1] 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231


Best,
Godfrey

Reply via email to