Hi, Thank you again for the discussion on this FLIP. If there are no further comments, I plan to start a voting thread tomorrow.
Best, Jiabao On 2023/12/20 14:09:49 Jiabao Sun wrote: > Hi, > > Thank you to everyone for the discussion on this FLIP, > especially Becket for providing guidance that made it more reasonable. > > The FLIP document[1] has been updated with the recent discussed content. > Please take a look to double-check it when you have time. > > If we can reach a consensus on this, I will open the voting thread in recent > days. > > Best, > Jiabao > > [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768 > > > > 2023年12月20日 11:38,Jiabao Sun <ji...@xtransfer.cn.INVALID> 写道: > > > > Thanks Becket, > > > > The behavior description has been added to the Public Interfaces section. > > > > Best, > > Jiabao > > > > > >> 2023年12月20日 08:17,Becket Qin <be...@gmail.com> 写道: > >> > >> Hi Jiabao, > >> > >> Thanks for updating the FLIP. > >> Can you add the behavior of the policies that are only applicable to some > >> but not all of the databases? This is a part of the intended behavior of > >> the proposed configuration. So, we should include that in the FLIP. > >> Otherwise, the FLIP looks good to me. > >> > >> Cheers, > >> > >> Jiangjie (Becket) Qin > >> > >> On Tue, Dec 19, 2023 at 11:00 PM Jiabao Sun <ji...@xtransfer.cn.invalid> > >> wrote: > >> > >>> Hi Becket, > >>> > >>> I share the same view as you regarding the prefix for this configuration > >>> option. > >>> > >>> For the JDBC connector, I prefer setting 'filter.handling.policy' = 'FOO' > >>> and throwing an exception when the database do not support that specific > >>> policy. > >>> > >>> Not using a prefix can reduce the learning curve for users and avoid > >>> introducing a new set of configuration options for every supported JDBC > >>> database. > >>> I think the policies we provide can be compatible with most databases that > >>> follow the JDBC protocol. > >>> However, there may be cases where certain databases cannot support some > >>> policies. > >>> Nevertheless, we can ensure fast failure and allow users to choose a > >>> suitable policy in such situations. > >>> > >>> I have removed the contents about the configuration prefix. > >>> Please help review it again. > >>> > >>> Thanks, > >>> Jiabao > >>> > >>> > >>>> 2023年12月19日 19:46,Becket Qin <be...@gmail.com> 写道: > >>>> > >>>> Hi Jiabao, > >>>> > >>>> Thanks for updating the FLIP. > >>>> > >>>> One more question regarding the JDBC connector, since it is a connector > >>>> shared by multiple databases, what if there is a filter handling policy > >>>> that is only applicable to one of the databases, but not the others? In > >>>> that case, how would the users specify that policy? > >>>> Unlike the example of orc format with 2nd+ level config, JDBC connector > >>>> only looks at the URL to decide which driver to use. > >>>> > >>>> For example, MySql supports policy FOO while other databases do not. If > >>>> users want to use FOO for MySql, what should they do? Will they set > >>>> '*mysql.filter.hanlding.policy' > >>>> = 'FOO', *which will only be picked up when the MySql driver is used? > >>>> Or they should just set* 'filter.handling.policy' = 'FOO', *and throw > >>>> exceptions when other JDBC drivers are used? Personally, I prefer the > >>>> latter. If we pick that, do we still need to mention the following? > >>>> > >>>> *The prefix is needed when the option is for a 2nd+ level. * > >>>>> 'connector' = 'filesystem', > >>>>> 'format' = 'orc', > >>>>> 'orc.filter.handling.policy' = 'NUBERIC_ONY' > >>>>> > >>>>> *In this case, the values of this configuration may be different > >>> depending > >>>>> on the format option. For example, orc format may have INDEXED_ONLY > >>> while > >>>>> parquet format may have something else. * > >>>>> > >>>> > >>>> I found this is somewhat misleading, because the example here is not a > >>> part > >>>> of the proposal of this FLIP. It is just an example explaining when a > >>>> prefix is needed, which seems orthogonal to the proposal in this FLIP. > >>>> > >>>> Thanks, > >>>> > >>>> Jiangjie (Becket) Qin > >>>> > >>>> > >>>> On Tue, Dec 19, 2023 at 10:09 AM Jiabao Sun <jiabao....@xtransfer.cn > >>> .invalid> > >>>> wrote: > >>>> > >>>>> Thanks Becket for the suggestions, > >>>>> > >>>>> Updated. > >>>>> Please help review it again when you have time. > >>>>> > >>>>> Best, > >>>>> Jiabao > >>>>> > >>>>> > >>>>>> 2023年12月19日 09:06,Becket Qin <be...@gmail.com> 写道: > >>>>>> > >>>>>> Hi JIabao, > >>>>>> > >>>>>> Thanks for updating the FLIP. It looks better. Some suggestions / > >>>>> questions: > >>>>>> > >>>>>> 1. In the motivation section: > >>>>>> > >>>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for > >>>>> users > >>>>>>> to control filter pushdown. **However, filter pushdown has some side > >>>>>>> effects, such as additional computational pressure on external > >>>>>>> systems. Moreover, Improper queries can lead to issues such as full > >>>>> table > >>>>>>> scans, which in turn can impact the stability of external systems.* > >>>>>> > >>>>>> This statement sounds like the side effects are there for all the > >>>>> systems, > >>>>>> which is inaccurate. Maybe we can say: > >>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for > >>>>> users > >>>>>> to control filter pushdown. **However, filter pushdown may have side > >>>>>> effects in some cases, **such as additional computational pressure on > >>>>>> external systems. The JDBC source is a typical example of that. If a > >>>>> filter > >>>>>> is pushed down to the database, an expensive full table scan may happen > >>>>> if > >>>>>> the filter involves unindexed columns.* > >>>>>> > >>>>>> 2. Regarding the prefix, usually a prefix is not required for the top > >>>>> level > >>>>>> connector options. This is because the *connector* option is already > >>>>> there. > >>>>>> So > >>>>>> 'connector' = 'jdbc', > >>>>>> 'filter.handling.policy' = 'ALWAYS' > >>>>>> is sufficient. > >>>>>> > >>>>>> The prefix is needed when the option is for a 2nd+ level. For example, > >>>>>> 'connector' = 'jdbc', > >>>>>> 'format' = 'orc', > >>>>>> 'orc.some.option' = 'some_value' > >>>>>> In this case, the prefix of "orc" is needed to make it clear this > >>> option > >>>>> is > >>>>>> for the format. > >>>>>> > >>>>>> I am guessing that the reason that currently the connector prefix is > >>>>> there > >>>>>> is because the values of this configuration may be different depending > >>> on > >>>>>> the connectors. For example, jdbc may have INDEXED_ONLY while MongoDB > >>> may > >>>>>> have something else. Personally speaking, I am fine if we do not have a > >>>>>> prefix in this case because users have already specified the connector > >>>>> type > >>>>>> and it is intuitive enough that the option value is for that connector, > >>>>> not > >>>>>> others. > >>>>>> > >>>>>> 3. can we clarify on the following statement: > >>>>>> > >>>>>>> *Introduce the native configuration [prefix].filter.handling.policy in > >>>>> the > >>>>>>> connector.* > >>>>>> > >>>>>> What do you mean by "native configuration"? From what I understand, the > >>>>>> FLIP does the following: > >>>>>> - introduces a new configuration to the JDBC and MongoDB connector. > >>>>>> - Suggests a convention option name if other connectors are going to > >>> add > >>>>> an > >>>>>> option for the same purpose. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Jiangjie (Becket) Qin > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Dec 18, 2023 at 5:45 PM Jiabao Sun <jiabao....@xtransfer.cn > >>>>> .invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Becket, > >>>>>>> > >>>>>>> The FLIP document[1] has been updated. > >>>>>>> Could you help take a look again? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Jiabao > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>> > >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768 > >>>>>>> > >>>>>>> > >>>>>>>> 2023年12月18日 16:53,Becket Qin <be...@gmail.com> 写道: > >>>>>>>> > >>>>>>>> Yes, that sounds reasonable to me. We can start with ALWAYS and > >>> NEVER, > >>>>>>> and > >>>>>>>> add more policies as needed. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Jiangjie (Becket) Qin > >>>>>>>> > >>>>>>>> On Mon, Dec 18, 2023 at 4:48 PM Jiabao Sun <jiabao....@xtransfer.cn > >>>>>>> .invalid> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks Bucket, > >>>>>>>>> > >>>>>>>>> The jdbc.filter.handling.policy is good to me as it provides > >>>>> sufficient > >>>>>>>>> extensibility for future filter pushdown optimizations. > >>>>>>>>> However, currently, we don't have an implementation for the AUTO > >>> mode, > >>>>>>> and > >>>>>>>>> it seems that the AUTO mode can easily be confused with the ALWAYS > >>>>> mode > >>>>>>>>> because users don't have the opportunity to MANUALLY decide which > >>>>>>> filters > >>>>>>>>> to push down. > >>>>>>>>> > >>>>>>>>> I suggest that we only introduce the ALWAYS and NEVER modes for now, > >>>>> and > >>>>>>>>> we can consider introducing more flexible policies in the future, > >>>>>>>>> such as INDEX_ONLY, NUMBERIC_ONLY and so on. > >>>>>>>>> > >>>>>>>>> WDYT? > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Jiabao > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> 2023年12月18日 16:27,Becket Qin <be...@gmail.com> 写道: > >>>>>>>>>> > >>>>>>>>>> Hi Jiabao, > >>>>>>>>>> > >>>>>>>>>> Please see the reply inline. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc > >>>>>>> repository > >>>>>>>>>>> and is not a standalone connector. > >>>>>>>>>>> Is it too unique to use "mysql" as the configuration option > >>> prefix? > >>>>>>>>>> > >>>>>>>>>> If the intended behavior makes sense to all the supported JDBC > >>>>> drivers, > >>>>>>>>> we > >>>>>>>>>> can make this a JDBC connector configuration. > >>>>>>>>>> > >>>>>>>>>> Also, I would like to ask about the difference in behavior between > >>>>> AUTO > >>>>>>>>> and > >>>>>>>>>>> ALWAYS. > >>>>>>>>>>> It seems that we cannot guarantee the pushing down of all filters > >>> to > >>>>>>> the > >>>>>>>>>>> external system under the ALWAYS > >>>>>>>>>>> mode because not all filters in Flink SQL are supported by the > >>>>>>> external > >>>>>>>>>>> system. > >>>>>>>>>>> Should we throw an error when encountering a filter that cannot be > >>>>>>>>> pushed > >>>>>>>>>>> down in the ALWAYS mode? > >>>>>>>>>> > >>>>>>>>>> The idea of AUTO is to do efficiency-aware pushdowns. The source > >>> will > >>>>>>>>> query > >>>>>>>>>> the external system (MySQL, Oracle, SQL Server, etc) first to > >>>>> retrieve > >>>>>>>>> the > >>>>>>>>>> information of the table. With that information, the source will > >>>>> decide > >>>>>>>>>> whether to further push a filter to the external system based on > >>> the > >>>>>>>>>> efficiency. E.g. only push the indexed fields. In contrast, ALWAYS > >>>>> will > >>>>>>>>>> just always push the supported filters to the external system, > >>>>>>> regardless > >>>>>>>>>> of the efficiency. In case there are filters that are not > >>> supported, > >>>>>>>>>> according to the current contract of SupportsFilterPushdown, these > >>>>>>>>>> unsupported filters should just be returned by the > >>>>>>>>>> *SupportsFilterPushdown.applyFilters()* method as remaining > >>> filters. > >>>>>>>>>> Therefore, there is no need to throw exceptions here. This is > >>> likely > >>>>>>> the > >>>>>>>>>> desired behavior for most users, IMO. If there are cases that users > >>>>>>>>> really > >>>>>>>>>> want to get alerted when a filter cannot be pushed to the external > >>>>>>>>> system, > >>>>>>>>>> we can add another value like "ENFORCED_ALWAYS", which behaves like > >>>>>>>>> ALWAYS, > >>>>>>>>>> but throws exceptions when a filter cannot be applied to the > >>> external > >>>>>>>>>> system. But personally I don't see much value in doing this. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> > >>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Mon, Dec 18, 2023 at 3:54 PM Jiabao Sun < > >>> jiabao....@xtransfer.cn > >>>>>>>>> .invalid> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Becket, > >>>>>>>>>>> > >>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc > >>>>>>> repository > >>>>>>>>>>> and is not a standalone connector. > >>>>>>>>>>> Is it too unique to use "mysql" as the configuration option > >>> prefix? > >>>>>>>>>>> > >>>>>>>>>>> Also, I would like to ask about the difference in behavior between > >>>>>>> AUTO > >>>>>>>>>>> and ALWAYS. > >>>>>>>>>>> It seems that we cannot guarantee the pushing down of all filters > >>> to > >>>>>>> the > >>>>>>>>>>> external system under the ALWAYS > >>>>>>>>>>> mode because not all filters in Flink SQL are supported by the > >>>>>>> external > >>>>>>>>>>> system. > >>>>>>>>>>> Should we throw an error when encountering a filter that cannot be > >>>>>>>>> pushed > >>>>>>>>>>> down in the ALWAYS mode? > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Jiabao > >>>>>>>>>>> > >>>>>>>>>>>> 2023年12月18日 15:34,Becket Qin <be...@gmail.com> 写道: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi JIabao, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for updating the FLIP. Maybe I did not explain it clearly > >>>>>>>>> enough. > >>>>>>>>>>> My > >>>>>>>>>>>> point is that given there are various good flavors of behaviors > >>>>>>>>> handling > >>>>>>>>>>>> filters pushed down, we should not have a common config of > >>>>>>>>>>>> "ignore.filter.pushdown", because the behavior is not *common*. > >>>>>>>>>>>> > >>>>>>>>>>>> It looks like the original motivation of this FLIP is just for > >>>>> MySql. > >>>>>>>>>>> Let's > >>>>>>>>>>>> focus on what is the best solution for MySql connector here > >>> first. > >>>>>>>>> After > >>>>>>>>>>>> that, if people think the best behavior for MySql happens to be a > >>>>>>>>> common > >>>>>>>>>>>> one, we can then discuss whether that is worth being added to the > >>>>>>> base > >>>>>>>>>>>> implementation of source. For MySQL, if we are going to > >>> introduce a > >>>>>>>>>>> config > >>>>>>>>>>>> to MySql, why not have something like > >>>>> "mysql.filter.handling.policy" > >>>>>>>>> with > >>>>>>>>>>>> value of AUTO / NEVER / ALWAYS? Isn't that better than > >>>>>>>>>>>> "ignore.filter.pushdown"? > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> > >>>>>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Sun, Dec 17, 2023 at 11:30 PM Jiabao Sun < > >>>>> jiabao....@xtransfer.cn > >>>>>>>>>>> .invalid> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Becket, > >>>>>>>>>>>>> > >>>>>>>>>>>>> The FLIP document has been updated as well. > >>>>>>>>>>>>> Please take a look when you have time. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Jiabao > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 2023年12月17日 22:54,Jiabao Sun <ji...@xtransfer.cn.INVALID> > >>>>> 写道: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Becket, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I apologize for not being able to continue with this proposal > >>> due > >>>>>>> to > >>>>>>>>>>>>> being too busy during this period. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The viewpoints you shared about the design of Flink Source make > >>>>>>> sense > >>>>>>>>>>> to > >>>>>>>>>>>>> me > >>>>>>>>>>>>>> The native configuration ‘ignore.filter.pushdown’ is good to > >>> me. > >>>>>>>>>>>>>> Having a unified name or naming style can indeed prevent > >>>>> confusion > >>>>>>>>> for > >>>>>>>>>>>>> users regarding > >>>>>>>>>>>>>> the inconsistent naming of this configuration across different > >>>>>>>>>>>>> connectors. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Currently, there are not many external connectors that support > >>>>>>> filter > >>>>>>>>>>>>> pushdown. > >>>>>>>>>>>>>> I propose that we first introduce it in flink-connector-jdbc > >>> and > >>>>>>>>>>>>> flink-connector-mongodb. > >>>>>>>>>>>>>> Do you think this is feasible? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>> [message truncated...]