sounds good to me. Thanks! Best regards, Jing
On Fri, Jun 17, 2022 at 5:37 AM godfrey he <godfre...@gmail.com> wrote: > Hi, Jing. > > Thanks for the feedback. > > >When will the converted SELECT statement of the ANALYZE TABLE be > > submitted? right after the CREATE TABLE? > The SELECT job will be submitted only when `ANALYZE TABLE` is executed, > and there is nothing to do with CREATE TABLE. Because the `ANALYZE TABLE` > is triggered manually as needed. > > >Will it be submitted periodically to keep the statistical data > >up-to-date, since the data might be mutable? > the `ANALYZE TABLE` is triggered manually as needed. > I will update the doc. > > >It might not be strong enough to avoid human error > > I would suggest using FOR ALL PARTITIONS explicitly > > just like FOR ALL COLUMNS. > Agree, specifying `PARTITION` explicitly is more friendly > and safe. I prefer to use `PARTITION(ds, hr)` without > specific partition value, hive has the similar syntax. > WDYT ? > > Best, > Godfrey > > Jing Ge <j...@ververica.com> 于2022年6月16日周四 03:53写道: > > > > Hi Godfrey, > > > > Thanks for driving this! There are some areas where I couldn't find > enough > > information in the FLIP, just wondering if I could get more > > explanation from you w.r.t. the following questions: > > > > 1. When will the converted SELECT statement of the ANALYZE TABLE be > > submitted? right after the CREATE TABLE? > > > > 2. Will it be submitted periodically to keep the statistical data > > up-to-date, since the data might be mutable? > > > > 3. " If no partition is specified, the statistics will be gathered for > all > > partitions" - I think this is fine for multi-level partitions, e.g. > PARTITION > > (ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1) > > and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code > > and therefore help developer work more efficiently. If we use this rule > for > > top level partitions, It might not be strong enough to avoid human > > error, e.g. developer might trigger huge selection on the table with many > > partitions, when he forgot to write the partition in the ANALYZE TABLE > > script. In this case, I would suggest using FOR ALL PARTITIONS explicitly > > just like FOR ALL COLUMNS. > > > > Best regards, > > Jing > > > > > > On Wed, Jun 15, 2022 at 10:16 AM godfrey he <godfre...@gmail.com> wrote: > > > > > Hi Jark, > > > > > > Thanks for the inputs. > > > > > > >Do we need to provide DESC EXTENDED <table_name> statement like > Spark[1] > > > to > > > >show statistic for table/partition/columns? > > > We do have supported `DESC EXTENDED` syntax, but currently only table > > > schema > > > will be display, I think we just need a JIRA to support it. > > > > > > > is it possible to ignore execution mode and force using batch mode > for > > > the statement? > > > As I replied above, The semantics of `ANALYZE TABLE` does not > > > distinguish batch and streaming, > > > It works for both batch and streaming, but the result of unbounded > > > sources is meaningless. > > > Currently, I throw exception for streaming mode, > > > and we can support streaming mode with bounded source in the future. > > > > > > Best, > > > Godfrey > > > > > > Jark Wu <imj...@gmail.com> 于2022年6月14日周二 17:56写道: > > > > > > > > Hi Godfrey, thanks for starting this discussion, this is a great > feature > > > > for batch users. > > > > > > > > The FLIP looks good to me in general. > > > > > > > > I only have 2 comments: > > > > > > > > 1) How do users know whether the given table or partition contains > > > required > > > > statistics? > > > > Do we need to provide DESC EXTENDED <table_name> statement like > Spark[1] > > > to > > > > show statistic for table/partition/columns? > > > > > > > > 2) If ANALYZE TABLE can only run in batch mode, is it possible to > ignore > > > > execution mode > > > > and force using batch mode for the statement? From my perspective, > > > ANALYZE > > > > TABLE > > > > is an auxiliary statement similar to SHOW TABLES but heavier, which > > > doesn't > > > > care about > > > > environment execution mode. > > > > > > > > Best, > > > > Jark > > > > > > > > [1]: > > > > > > > > https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html > > > > > > > > On Tue, 14 Jun 2022 at 13:52, Jing Ge <j...@ververica.com> wrote: > > > > > > > > > Hi 华宗 > > > > > > > > > > 退订请发送任意消息至dev-unsubscr...@flink.apache.org > > > > > In order to unsubscribe, please send an email to > > > > > dev-unsubscr...@flink.apache.org > > > > > > > > > > Thanks > > > > > > > > > > Best regards, > > > > > Jing > > > > > > > > > > > > > > > On Tue, Jun 14, 2022 at 2:05 AM 华宗 <zhanghuaz...@126.com> wrote: > > > > > > > > > > > 退订 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2022-06-13 22:44:24, "cao zou" <zoucao...@gmail.com> wrote: > > > > > > >Hi godfrey, thanks for your detail explanation. > > > > > > >After explaining and glancing over the FLIP-231, I think it is > > > > > > >really need, +1 for this and looking forward to it. > > > > > > > > > > > > > >best > > > > > > >zoucao > > > > > > > > > > > > > >godfrey he <godfre...@gmail.com> 于2022年6月13日周一 14:43写道: > > > > > > > > > > > > > >> Hi Ingo, > > > > > > >> > > > > > > >> The semantics does not distinguish batch and streaming, > > > > > > >> It works for both batch and streaming, but the result of > > > > > > >> unbounded sources is meaningless. > > > > > > >> Currently, I throw exception for streaming mode, > > > > > > >> and we can support streaming mode with bounded source > > > > > > >> in the future. > > > > > > >> > > > > > > >> Best, > > > > > > >> Godfrey > > > > > > >> > > > > > > >> Ingo Bürk <airbla...@apache.org> 于2022年6月13日周一 14:17写道: > > > > > > >> > > > > > > > >> > Hi Godfrey, > > > > > > >> > > > > > > > >> > thank you for the explanation. A SELECT is definitely more > > > generic > > > > > and > > > > > > >> > will work for all connectors automatically. As such I think > > > it's a > > > > > > good > > > > > > >> > baseline solution regardless. > > > > > > >> > > > > > > > >> > We can also think about allowing connector-specific > > > optimizations in > > > > > > the > > > > > > >> > future, but I do like your idea of letting the optimizer > rules > > > > > > perform a > > > > > > >> > lot of the work here already by leveraging existing > > > optimizations. > > > > > > >> > Similarly things like non-null counts of non-nullable > columns > > > would > > > > > > (or > > > > > > >> > at least could) be handled by the optimizer rules already. > > > > > > >> > > > > > > > >> > So as far as that point goes, +1 to the generic approach. > > > > > > >> > > > > > > > >> > One more point, though: In general we should avoid > supporting > > > > > features > > > > > > >> > only in specific modes as it breaks the unification promise. > > > Given > > > > > > that > > > > > > >> > ANALYZE is a manual and completely optional operation I'm OK > > > with > > > > > > doing > > > > > > >> > that here in principle. However, I wonder what will happen > in > > > the > > > > > > >> > streaming / unbounded case. Do you plan to throw an error? > Or > > > do we > > > > > > >> > complete the command as successful but without doing > anything? > > > > > > >> > > > > > > > >> > > > > > > > >> > Best > > > > > > >> > Ingo > > > > > > >> > > > > > > > >> > On 13.06.22 05:50, godfrey he wrote: > > > > > > >> > > Hi Ingo, > > > > > > >> > > > > > > > > >> > > Thanks for the inputs. > > > > > > >> > > > > > > > > >> > > I think converting `ANALYZE TABLE` to `SELECT` statement > is > > > > > > >> > > more generic approach. Because query plan optimization is > more > > > > > > generic, > > > > > > >> > > we can provide more optimization rules to optimize not > only > > > > > > `SELECT` > > > > > > >> statement > > > > > > >> > > converted from `ANALYZE TABLE` but also the `SELECT` > statement > > > > > > written > > > > > > >> by users. > > > > > > >> > > > > > > > > >> > >> JDBC connector can get a row count estimate without > > > performing a > > > > > > >> > >> SELECT COUNT(1) > > > > > > >> > > To optimize such cases, we can implement a rule to push > > > aggregate > > > > > > into > > > > > > >> > > table source. > > > > > > >> > > Currently, there is a similar rule: > SupportsAggregatePushDown, > > > > > which > > > > > > >> > > supports only pushing > > > > > > >> > > local aggregate into source now. > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > Best, > > > > > > >> > > Godfrey > > > > > > >> > > > > > > > > >> > > Ingo Bürk <airbla...@apache.org> 于2022年6月10日周五 17:15写道: > > > > > > >> > >> > > > > > > >> > >> Hi Godfrey, > > > > > > >> > >> > > > > > > >> > >> compared to the solution proposed in the FLIP (using a > SELECT > > > > > > >> > >> statement), I wonder if you have considered adding APIs > to > > > > > > catalogs / > > > > > > >> > >> connectors to perform this task as an alternative? > > > > > > >> > >> I could imagine that for many connectors, statistics > could be > > > > > > >> > >> implemented in a less expensive way by leveraging the > > > underlying > > > > > > >> system > > > > > > >> > >> (e.g. a JDBC connector can get a row count estimate > without > > > > > > >> performing a > > > > > > >> > >> SELECT COUNT(1)). > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> Best > > > > > > >> > >> Ingo > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> On 10.06.22 09:53, godfrey he wrote: > > > > > > >> > >>> Hi all, > > > > > > >> > >>> > > > > > > >> > >>> I would like to open a discussion on FLIP-240: > Introduce > > > > > "ANALYZE > > > > > > >> > >>> TABLE" Syntax. > > > > > > >> > >>> > > > > > > >> > >>> As FLIP-231 mentioned, statistics are one of the most > > > important > > > > > > >> inputs > > > > > > >> > >>> to the optimizer. Accurate and complete statistics > allows > > > the > > > > > > >> > >>> optimizer to be more powerful. "ANALYZE TABLE" syntax > is a > > > very > > > > > > >> common > > > > > > >> > >>> but effective approach to gather statistics, which is > > > already > > > > > > >> > >>> introduced by many compute engines and databases. > > > > > > >> > >>> > > > > > > >> > >>> The main purpose of discussion is to introduce "ANALYZE > > > TABLE" > > > > > > >> syntax > > > > > > >> > >>> for Flink sql. > > > > > > >> > >>> > > > > > > >> > >>> You can find more details in FLIP-240 document[1]. > Looking > > > > > > forward to > > > > > > >> > >>> your feedback. > > > > > > >> > >>> > > > > > > >> > >>> [1] > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481 > > > > > > >> > >>> [2] POC: > https://github.com/godfreyhe/flink/tree/FLIP-240 > > > > > > >> > >>> > > > > > > >> > >>> > > > > > > >> > >>> Best, > > > > > > >> > >>> Godfrey > > > > > > >> > > > > > > > > > > > > > > >