RE: Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

Jiabao Sun Tue, 02 Jan 2024 22:44:05 -0800

Hi,

Thank you again for the discussion on this FLIP.
If there are no further comments, I plan to start a voting thread tomorrow.


Best,
Jiabao

On 2023/12/20 14:09:49 Jiabao Sun wrote:
> Hi,
> 
> Thank you to everyone for the discussion on this FLIP, 
> especially Becket for providing guidance that made it more reasonable. 
> 
> The FLIP document[1] has been updated with the recent discussed content. 
> Please take a look to double-check it when you have time.
> 
> If we can reach a consensus on this, I will open the voting thread in recent 
> days.
> 
> Best,
> Jiabao
> 
> [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768
> 
> 
> > 2023年12月20日 11:38，Jiabao Sun <ji...@xtransfer.cn.INVALID> 写道：
> > 
> > Thanks Becket,
> > 
> > The behavior description has been added to the Public Interfaces section.
> > 
> > Best,
> > Jiabao
> > 
> > 
> >> 2023年12月20日 08:17，Becket Qin <be...@gmail.com> 写道：
> >> 
> >> Hi Jiabao,
> >> 
> >> Thanks for updating the FLIP.
> >> Can you add the behavior of the policies that are only applicable to some
> >> but not all of the databases? This is a part of the intended behavior of
> >> the proposed configuration. So, we should include that in the FLIP.
> >> Otherwise, the FLIP looks good to me.
> >> 
> >> Cheers,
> >> 
> >> Jiangjie (Becket) Qin
> >> 
> >> On Tue, Dec 19, 2023 at 11:00 PM Jiabao Sun <ji...@xtransfer.cn.invalid>
> >> wrote:
> >> 
> >>> Hi Becket,
> >>> 
> >>> I share the same view as you regarding the prefix for this configuration
> >>> option.
> >>> 
> >>> For the JDBC connector, I prefer setting 'filter.handling.policy' = 'FOO'
> >>> and throwing an exception when the database do not support that specific
> >>> policy.
> >>> 
> >>> Not using a prefix can reduce the learning curve for users and avoid
> >>> introducing a new set of configuration options for every supported JDBC
> >>> database.
> >>> I think the policies we provide can be compatible with most databases that
> >>> follow the JDBC protocol.
> >>> However, there may be cases where certain databases cannot support some
> >>> policies.
> >>> Nevertheless, we can ensure fast failure and allow users to choose a
> >>> suitable policy in such situations.
> >>> 
> >>> I have removed the contents about the configuration prefix.
> >>> Please help review it again.
> >>> 
> >>> Thanks,
> >>> Jiabao
> >>> 
> >>> 
> >>>> 2023年12月19日 19:46，Becket Qin <be...@gmail.com> 写道：
> >>>> 
> >>>> Hi Jiabao,
> >>>> 
> >>>> Thanks for updating the FLIP.
> >>>> 
> >>>> One more question regarding the JDBC connector, since it is a connector
> >>>> shared by multiple databases, what if there is a filter handling policy
> >>>> that is only applicable to one of the databases, but not the others? In
> >>>> that case, how would the users specify that policy?
> >>>> Unlike the example of orc format with 2nd+ level config, JDBC connector
> >>>> only looks at the URL to decide which driver to use.
> >>>> 
> >>>> For example, MySql supports policy FOO while other databases do not. If
> >>>> users want to use FOO for MySql, what should they do? Will they set
> >>>> '*mysql.filter.hanlding.policy'
> >>>> = 'FOO', *which will only be picked up when the MySql driver is used?
> >>>> Or they should just set* 'filter.handling.policy' = 'FOO', *and throw
> >>>> exceptions when other JDBC drivers are used? Personally, I prefer the
> >>>> latter. If we pick that, do we still need to mention the following?
> >>>> 
> >>>> *The prefix is needed when the option is for a 2nd+ level. *
> >>>>> 'connector' = 'filesystem',
> >>>>> 'format' = 'orc',
> >>>>> 'orc.filter.handling.policy' = 'NUBERIC_ONY'
> >>>>> 
> >>>>> *In this case, the values of this configuration may be different
> >>> depending
> >>>>> on the format option. For example, orc format may have INDEXED_ONLY
> >>> while
> >>>>> parquet format may have something else. *
> >>>>> 
> >>>> 
> >>>> I found this is somewhat misleading, because the example here is not a
> >>> part
> >>>> of the proposal of this FLIP. It is just an example explaining when a
> >>>> prefix is needed, which seems orthogonal to the proposal in this FLIP.
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> Jiangjie (Becket) Qin
> >>>> 
> >>>> 
> >>>> On Tue, Dec 19, 2023 at 10:09 AM Jiabao Sun <jiabao....@xtransfer.cn
> >>> .invalid>
> >>>> wrote:
> >>>> 
> >>>>> Thanks Becket for the suggestions,
> >>>>> 
> >>>>> Updated.
> >>>>> Please help review it again when you have time.
> >>>>> 
> >>>>> Best,
> >>>>> Jiabao
> >>>>> 
> >>>>> 
> >>>>>> 2023年12月19日 09:06，Becket Qin <be...@gmail.com> 写道：
> >>>>>> 
> >>>>>> Hi JIabao,
> >>>>>> 
> >>>>>> Thanks for updating the FLIP. It looks better. Some suggestions /
> >>>>> questions:
> >>>>>> 
> >>>>>> 1. In the motivation section:
> >>>>>> 
> >>>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for
> >>>>> users
> >>>>>>> to control filter pushdown. **However, filter pushdown has some side
> >>>>>>> effects, such as additional computational pressure on external
> >>>>>>> systems. Moreover, Improper queries can lead to issues such as full
> >>>>> table
> >>>>>>> scans, which in turn can impact the stability of external systems.*
> >>>>>> 
> >>>>>> This statement sounds like the side effects are there for all the
> >>>>> systems,
> >>>>>> which is inaccurate. Maybe we can say:
> >>>>>> *Currently, Flink Table/SQL does not expose fine-grained control for
> >>>>> users
> >>>>>> to control filter pushdown. **However, filter pushdown may have side
> >>>>>> effects in some cases, **such as additional computational pressure on
> >>>>>> external systems. The JDBC source is a typical example of that. If a
> >>>>> filter
> >>>>>> is pushed down to the database, an expensive full table scan may happen
> >>>>> if
> >>>>>> the filter involves unindexed columns.*
> >>>>>> 
> >>>>>> 2. Regarding the prefix, usually a prefix is not required for the top
> >>>>> level
> >>>>>> connector options. This is because the *connector* option is already
> >>>>> there.
> >>>>>> So
> >>>>>> 'connector' = 'jdbc',
> >>>>>> 'filter.handling.policy' = 'ALWAYS'
> >>>>>> is sufficient.
> >>>>>> 
> >>>>>> The prefix is needed when the option is for a 2nd+ level. For example,
> >>>>>> 'connector' = 'jdbc',
> >>>>>> 'format' = 'orc',
> >>>>>> 'orc.some.option' = 'some_value'
> >>>>>> In this case, the prefix of "orc" is needed to make it clear this
> >>> option
> >>>>> is
> >>>>>> for the format.
> >>>>>> 
> >>>>>> I am guessing that the reason that currently the connector prefix is
> >>>>> there
> >>>>>> is because the values of this configuration may be different depending
> >>> on
> >>>>>> the connectors. For example, jdbc may have INDEXED_ONLY while MongoDB
> >>> may
> >>>>>> have something else. Personally speaking, I am fine if we do not have a
> >>>>>> prefix in this case because users have already specified the connector
> >>>>> type
> >>>>>> and it is intuitive enough that the option value is for that connector,
> >>>>> not
> >>>>>> others.
> >>>>>> 
> >>>>>> 3. can we clarify on the following statement:
> >>>>>> 
> >>>>>>> *Introduce the native configuration [prefix].filter.handling.policy in
> >>>>> the
> >>>>>>> connector.*
> >>>>>> 
> >>>>>> What do you mean by "native configuration"? From what I understand, the
> >>>>>> FLIP does the following:
> >>>>>> - introduces a new configuration to the JDBC and MongoDB connector.
> >>>>>> - Suggests a convention option name if other connectors are going to
> >>> add
> >>>>> an
> >>>>>> option for the same purpose.
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Jiangjie (Becket) Qin
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> On Mon, Dec 18, 2023 at 5:45 PM Jiabao Sun <jiabao....@xtransfer.cn
> >>>>> .invalid>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi Becket,
> >>>>>>> 
> >>>>>>> The FLIP document[1] has been updated.
> >>>>>>> Could you help take a look again?
> >>>>>>> 
> >>>>>>> Thanks,
> >>>>>>> Jiabao
> >>>>>>> 
> >>>>>>> [1]
> >>>>>>> 
> >>>>> 
> >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> 2023年12月18日 16:53，Becket Qin <be...@gmail.com> 写道：
> >>>>>>>> 
> >>>>>>>> Yes, that sounds reasonable to me. We can start with ALWAYS and
> >>> NEVER,
> >>>>>>> and
> >>>>>>>> add more policies as needed.
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> 
> >>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>> 
> >>>>>>>> On Mon, Dec 18, 2023 at 4:48 PM Jiabao Sun <jiabao....@xtransfer.cn
> >>>>>>> .invalid>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Thanks Bucket,
> >>>>>>>>> 
> >>>>>>>>> The jdbc.filter.handling.policy is good to me as it provides
> >>>>> sufficient
> >>>>>>>>> extensibility for future filter pushdown optimizations.
> >>>>>>>>> However, currently, we don't have an implementation for the AUTO
> >>> mode,
> >>>>>>> and
> >>>>>>>>> it seems that the AUTO mode can easily be confused with the ALWAYS
> >>>>> mode
> >>>>>>>>> because users don't have the opportunity to MANUALLY decide which
> >>>>>>> filters
> >>>>>>>>> to push down.
> >>>>>>>>> 
> >>>>>>>>> I suggest that we only introduce the ALWAYS and NEVER modes for now,
> >>>>> and
> >>>>>>>>> we can consider introducing more flexible policies in the future,
> >>>>>>>>> such as INDEX_ONLY, NUMBERIC_ONLY and so on.
> >>>>>>>>> 
> >>>>>>>>> WDYT?
> >>>>>>>>> 
> >>>>>>>>> Best,
> >>>>>>>>> Jiabao
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> 2023年12月18日 16:27，Becket Qin <be...@gmail.com> 写道：
> >>>>>>>>>> 
> >>>>>>>>>> Hi Jiabao,
> >>>>>>>>>> 
> >>>>>>>>>> Please see the reply inline.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc
> >>>>>>> repository
> >>>>>>>>>>> and is not a standalone connector.
> >>>>>>>>>>> Is it too unique to use "mysql" as the configuration option
> >>> prefix?
> >>>>>>>>>> 
> >>>>>>>>>> If the intended behavior makes sense to all the supported JDBC
> >>>>> drivers,
> >>>>>>>>> we
> >>>>>>>>>> can make this a JDBC connector configuration.
> >>>>>>>>>> 
> >>>>>>>>>> Also, I would like to ask about the difference in behavior between
> >>>>> AUTO
> >>>>>>>>> and
> >>>>>>>>>>> ALWAYS.
> >>>>>>>>>>> It seems that we cannot guarantee the pushing down of all filters
> >>> to
> >>>>>>> the
> >>>>>>>>>>> external system under the ALWAYS
> >>>>>>>>>>> mode because not all filters in Flink SQL are supported by the
> >>>>>>> external
> >>>>>>>>>>> system.
> >>>>>>>>>>> Should we throw an error when encountering a filter that cannot be
> >>>>>>>>> pushed
> >>>>>>>>>>> down in the ALWAYS mode?
> >>>>>>>>>> 
> >>>>>>>>>> The idea of AUTO is to do efficiency-aware pushdowns. The source
> >>> will
> >>>>>>>>> query
> >>>>>>>>>> the external system (MySQL, Oracle, SQL Server, etc) first to
> >>>>> retrieve
> >>>>>>>>> the
> >>>>>>>>>> information of the table. With that information, the source will
> >>>>> decide
> >>>>>>>>>> whether to further push a filter to the external system based on
> >>> the
> >>>>>>>>>> efficiency. E.g. only push the indexed fields. In contrast, ALWAYS
> >>>>> will
> >>>>>>>>>> just always push the supported filters to the external system,
> >>>>>>> regardless
> >>>>>>>>>> of the efficiency. In case there are filters that are not
> >>> supported,
> >>>>>>>>>> according to the current contract of SupportsFilterPushdown, these
> >>>>>>>>>> unsupported filters should just be returned by the
> >>>>>>>>>> *SupportsFilterPushdown.applyFilters()* method as remaining
> >>> filters.
> >>>>>>>>>> Therefore, there is no need to throw exceptions here. This is
> >>> likely
> >>>>>>> the
> >>>>>>>>>> desired behavior for most users, IMO. If there are cases that users
> >>>>>>>>> really
> >>>>>>>>>> want to get alerted when a filter cannot be pushed to the external
> >>>>>>>>> system,
> >>>>>>>>>> we can add another value like "ENFORCED_ALWAYS", which behaves like
> >>>>>>>>> ALWAYS,
> >>>>>>>>>> but throws exceptions when a filter cannot be applied to the
> >>> external
> >>>>>>>>>> system. But personally I don't see much value in doing this.
> >>>>>>>>>> 
> >>>>>>>>>> Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> On Mon, Dec 18, 2023 at 3:54 PM Jiabao Sun <
> >>> jiabao....@xtransfer.cn
> >>>>>>>>> .invalid>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Hi Becket,
> >>>>>>>>>>> 
> >>>>>>>>>>> The MySQL connector is currently in the flink-connector-jdbc
> >>>>>>> repository
> >>>>>>>>>>> and is not a standalone connector.
> >>>>>>>>>>> Is it too unique to use "mysql" as the configuration option
> >>> prefix?
> >>>>>>>>>>> 
> >>>>>>>>>>> Also, I would like to ask about the difference in behavior between
> >>>>>>> AUTO
> >>>>>>>>>>> and ALWAYS.
> >>>>>>>>>>> It seems that we cannot guarantee the pushing down of all filters
> >>> to
> >>>>>>> the
> >>>>>>>>>>> external system under the ALWAYS
> >>>>>>>>>>> mode because not all filters in Flink SQL are supported by the
> >>>>>>> external
> >>>>>>>>>>> system.
> >>>>>>>>>>> Should we throw an error when encountering a filter that cannot be
> >>>>>>>>> pushed
> >>>>>>>>>>> down in the ALWAYS mode?
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jiabao
> >>>>>>>>>>> 
> >>>>>>>>>>>> 2023年12月18日 15:34，Becket Qin <be...@gmail.com> 写道：
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Hi JIabao,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Thanks for updating the FLIP. Maybe I did not explain it clearly
> >>>>>>>>> enough.
> >>>>>>>>>>> My
> >>>>>>>>>>>> point is that given there are various good flavors of behaviors
> >>>>>>>>> handling
> >>>>>>>>>>>> filters pushed down, we should not have a common config of
> >>>>>>>>>>>> "ignore.filter.pushdown", because the behavior is not *common*.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> It looks like the original motivation of this FLIP is just for
> >>>>> MySql.
> >>>>>>>>>>> Let's
> >>>>>>>>>>>> focus on what is the best solution for MySql connector here
> >>> first.
> >>>>>>>>> After
> >>>>>>>>>>>> that, if people think the best behavior for MySql happens to be a
> >>>>>>>>> common
> >>>>>>>>>>>> one, we can then discuss whether that is worth being added to the
> >>>>>>> base
> >>>>>>>>>>>> implementation of source. For MySQL, if we are going to
> >>> introduce a
> >>>>>>>>>>> config
> >>>>>>>>>>>> to MySql, why not have something like
> >>>>> "mysql.filter.handling.policy"
> >>>>>>>>> with
> >>>>>>>>>>>> value of AUTO / NEVER / ALWAYS? Isn't that better than
> >>>>>>>>>>>> "ignore.filter.pushdown"?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Sun, Dec 17, 2023 at 11:30 PM Jiabao Sun <
> >>>>> jiabao....@xtransfer.cn
> >>>>>>>>>>> .invalid>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> Hi Becket,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> The FLIP document has been updated as well.
> >>>>>>>>>>>>> Please take a look when you have time.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Jiabao
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 2023年12月17日 22:54，Jiabao Sun <ji...@xtransfer.cn.INVALID>
> >>>>> 写道：
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thanks Becket,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I apologize for not being able to continue with this proposal
> >>> due
> >>>>>>> to
> >>>>>>>>>>>>> being too busy during this period.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The viewpoints you shared about the design of Flink Source make
> >>>>>>> sense
> >>>>>>>>>>> to
> >>>>>>>>>>>>> me
> >>>>>>>>>>>>>> The native configuration ‘ignore.filter.pushdown’ is good to
> >>> me.
> >>>>>>>>>>>>>> Having a unified name or naming style can indeed prevent
> >>>>> confusion
> >>>>>>>>> for
> >>>>>>>>>>>>> users regarding
> >>>>>>>>>>>>>> the inconsistent naming of this configuration across different
> >>>>>>>>>>>>> connectors.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Currently, there are not many external connectors that support
> >>>>>>> filter
> >>>>>>>>>>>>> pushdown.
> >>>>>>>>>>>>>> I propose that we first introduce it in flink-connector-jdbc
> >>> and
> >>>>>>>>>>>>> flink-connector-mongodb.
> >>>>>>>>>>>>>> Do you think this is feasible?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Best,
> >>>>>>>
[message truncated...]

RE: Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

Reply via email to