Hi Godfrey,
thank you for the explanation. A SELECT is definitely more generic and
will work for all connectors automatically. As such I think it's a good
baseline solution regardless.
We can also think about allowing connector-specific optimizations in the
future, but I do like your idea of letting the optimizer rules perform a
lot of the work here already by leveraging existing optimizations.
Similarly things like non-null counts of non-nullable columns would (or
at least could) be handled by the optimizer rules already.
So as far as that point goes, +1 to the generic approach.
One more point, though: In general we should avoid supporting features
only in specific modes as it breaks the unification promise. Given that
ANALYZE is a manual and completely optional operation I'm OK with doing
that here in principle. However, I wonder what will happen in the
streaming / unbounded case. Do you plan to throw an error? Or do we
complete the command as successful but without doing anything?
On 13.06.22 05:50, godfrey he wrote:
Hi Ingo,
Thanks for the inputs.
I think converting `ANALYZE TABLE` to `SELECT` statement is
more generic approach. Because query plan optimization is more generic,
we can provide more optimization rules to optimize not only `SELECT` statement
converted from `ANALYZE TABLE` but also the `SELECT` statement written by users.
JDBC connector can get a row count estimate without performing a
To optimize such cases, we can implement a rule to push aggregate into
table source.
Currently, there is a similar rule: SupportsAggregatePushDown, which
supports only pushing
local aggregate into source now.
Ingo Bürk <airbla...@apache.org> 于2022年6月10日周五 17:15写道:
Hi Godfrey,
compared to the solution proposed in the FLIP (using a SELECT
statement), I wonder if you have considered adding APIs to catalogs /
connectors to perform this task as an alternative?
I could imagine that for many connectors, statistics could be
implemented in a less expensive way by leveraging the underlying system
(e.g. a JDBC connector can get a row count estimate without performing a
On 10.06.22 09:53, godfrey he wrote:
Hi all,
I would like to open a discussion on FLIP-240: Introduce "ANALYZE
TABLE" Syntax.
As FLIP-231 mentioned, statistics are one of the most important inputs
to the optimizer. Accurate and complete statistics allows the
optimizer to be more powerful. "ANALYZE TABLE" syntax is a very common
but effective approach to gather statistics, which is already
introduced by many compute engines and databases.
The main purpose of discussion is to introduce "ANALYZE TABLE" syntax
for Flink sql.
You can find more details in FLIP-240 document[1]. Looking forward to
your feedback.
[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240