Hi @yuxia, Thank you for reviewing the FLIP and raising questions. 1: Is the PlanAnalyzerFactory also expected to be implemented by users just > like DynamicTableSourceFactory or other factories? If so, I notice that in > the code of PlanAnalyzerManager#registerAnalyzers, the code is as follows: > FactoryUtil.discoverFactory(classLoader, PlanAnalyzerFactory.class, > StreamPlanAnalyzerFactory.STREAM_IDENTIFIER)); IIUC, it'll always find the > factory with the name StreamPlanAnalyzerFactory.STREAM_IDENTIFIER; Is it a > typo or by design ?
This is a really good open question. For the short answer, yes, it is by design. I'll explain the consideration in more detail. The standard procedure to create a custom table source/sink is to implement the factory and the source/sink class. There is a strong 1v1 relationship between the factory and the source/sink. SQL DynamicTableSourceFactory Source create table … with (‘connector’ = ‘foo’) #factoryIdentifer.equals(“foo”) FooTableSource *Apart from that, the custom function module is another kind of implementation. The factory creates a collection of functions. This is a 1vN relationship between the factory and the functions.* SQL ModuleFactory Function load module ‘bar’ #factoryIdentifier.equals(“bar”) A collection of functions Back to the plan analyzers, if we choose the first style, we also need to expose a new SQL syntax to users, like "CREATE ANALYZER foo WITH ..." to specify the factory identifier. But I think it is too heavy because an analyzer is an auxiliary tool to help users write better queries, and thus it should be exposed at the API level other than the user syntax level. As a result, I propose to follow the second style. Then we don't need to introduce new syntax to create analyzers. Let StreamPlanAnalyzerFactory be the default factory to create analyzers under the streaming mode, and the custom analyzers will register themselves in StreamPlanAnalyzerFactory. @Override public List<PlanAnalyzer> createAnalyzers() { return Arrays.asList( FooAnalyzer.INSTANCE, BarAnalyzer.INSTANCE, ...); } 2: Is there any special reason make PlanAdvice be a final class? Would it > be better to make it an interface and we provide a default implementation? > My concern is some users may want have their own implementation for > PlanAdvice. But it may be overthinking. If you think it won't bring any > problem, I'm also fine with that. The reason why making PlanAdvice final is that I think users would prefer to implement the custom PlanAnalyzer than PlanAdvice. PlanAdvice is a POJO class to represent the analyzed result provided by PlanAnalyzer. 3: Is there a way only show advice? For me, it seems the advice will be > more useful and the nodes may contains to many details. The result contains two parts: the optimized physical plan itself + the analysis of the plan. For PlanAdvice with the scope as GLOBAL, it is possible to do so. While for a LOCAL scope, the advice content is specific to certain nodes (E.g., some certain rel nodes are sensitive to state TTL configuration). In this situation, the plan cannot be omitted. On the other hand, the plan is necessary from the visualization perspective. During the PoC phase, I made some attempts to adapt the Flink Visualizer to illustrate the analyzed plan, and it looks like the following pic. I think this is intuitive to help users understand their queries and what they can do according to the advice. 4: I'm curious about what't the global advice will look like. Is it > possible to provide an example? Here is an example to illustrate the non-deterministic update issue. create temporary table cdc_with_meta ( a int, b bigint, c string, d boolean, metadata_1 int metadata, metadata_2 string metadata, metadata_3 bigint metadata, primary key (a) not enforced ) with ( 'connector' = 'values', 'changelog-mode' = 'I,UA,UB,D', 'readable-metadata' = 'metadata_1:INT, metadata_2:STRING, metadata_3:BIGINT' ); create temporary table sink_without_pk ( a int, b bigint, c string ) with ( 'connector' = 'values', 'sink-insert-only' = 'false' ); insert into sink_without_pk select a, metadata_3, c from cdc_with_meta; And with compilation as SCHEMA, the result is as below. { "nodes" : [ { "id" : 1, "type" : "StreamPhysicalTableSourceScan", "digest" : "TableSourceScan(table=[[default_catalog, default_database, cdc_with_meta, project=[a, c], metadata=[metadata_3]]], fields=[a, c, metadata_3], upsertKeys=[[a]])", "changelog_mode" : "I,UB,UA,D" }, { "id" : 2, "type" : "StreamPhysicalCalc", "digest" : "Calc(select=[a, metadata_3, c], upsertKeys=[[a]])", "changelog_mode" : "I,UB,UA,D", "predecessors" : [ { "id" : 1, "distribution" : "ANY", "changelog_mode" : "I,UB,UA,D" } ] }, { "id" : 3, "type" : "StreamPhysicalSink", "digest" : "Sink(table=[default_catalog.default_database.sink_without_pk], fields=[a, metadata_3, c])", "changelog_mode" : "NONE", "predecessors" : [ { "id" : 2, "distribution" : "ANY", "changelog_mode" : "I,UB,UA,D" } ] } ], "advice" : [ { "kind" : "WARNING", "scope" : "GLOBAL", "content" : "The metadata column(s): 'metadata_3' in cdc source may cause wrong result or error on downstream operators, please consider removing these columns or use a non-cdc source that only has insert messages.\nsource node:\nTableSourceScan(table=[[default_catalog, default_database, cdc_with_meta, project=[a, c], metadata=[metadata_3]]], fields=[a, c, metadata_3], changelogMode=[I,UB,UA,D], upsertKeys=[[a]])\n" } ] } Best regards, Jane Chan On Tue, Dec 27, 2022 at 8:06 PM yuxia <luoyu...@alumni.sjtu.edu.cn> wrote: > Thanks for driving this FLIP. It should be a good improvement to users. > But I have few questions: > 1: Is the PlanAnalyzerFactory also expected to be implemented by users > just like DynamicTableSourceFactory or other factories? If so, I notice > that in the code of PlanAnalyzerManager#registerAnalyzers, the code is as > follows: > FactoryUtil.discoverFactory(classLoader, PlanAnalyzerFactory.class, > StreamPlanAnalyzerFactory.STREAM_IDENTIFIER)); > > IIUC, it'll always find the factory with the name > StreamPlanAnalyzerFactory.STREAM_IDENTIFIER; Is it a typo or by design ? > > 2: Is there any special reason make PlanAdvice be a final class? Would it > be better to make it an interface and we provide a default implementation? > My concern is some users may want have their own implementation for > PlanAdvice. But it may be overthinking. If you think it won't bring any > problem, I'm also fine with that. > > 3: Is there a way only show advice? For me, it seems the advice will be > more useful and the nodes may contains to many details. > > 4: I'm curious about what't the global advice will look like. Is it > possible to provide an example? > > > Best regards, > Yuxia > > ----- 原始邮件 ----- > 发件人: "Jane Chan" <qingyue....@gmail.com> > 收件人: "dev" <dev@flink.apache.org> > 发送时间: 星期一, 2022年 12 月 26日 下午 9:39:18 > 主题: [DISCUSS] FLIP-280: Introduce a new explain mode to provide SQL advice > > Hi, devs, > > I would like to start a discussion on FLIP-280: Introduce a new explain > mode to provide SQL advice[1]. > > Currently, Flink SQL EXPLAIN statement provides three details to display > the plan. However, there is a considerable gap between the current > explained result and the actual, applicable actions for users to improve > their queries. > > To provide more understandable, actionable advice closer to the user's > perspective, we propose a new explain mode that analyzes the physical plan > and attaches available tuning advice and data correctness warnings. > > EXPLAIN ANALYZED_PHYSICAL_PLAN <query> > > I've included more details in [1], and I look forward to your feedback. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-280%3A+Support+EXPLAIN+SQL+statements+with+advice > [2] POC: https://github.com/LadyForest/flink/tree/FLIP-280 > > Best regards, > Jane Chan >