Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-09 Thread Jark Wu
Hi Xuannan,

I leave my comments inline.

> In the case where a user wants to
> use a CDC source and also determine backlog status based on watermark
> lag, we still need to define the rule when that occurs

This rule should be defined by the source itself (who knows backlog best),
not by the framework. In the case of CDC source, it reports isBacklog=true
during snapshot stage, and report isBacklog=false during changelog stage if
watermark-lag is within the threshold.

I think it's not intuitive to combine it with the logical OR operation.
Even for the
combination logic of backlog status from different channels, FLIP-309 said
it is
"up to the operator to determine its output records' isBacklog value" and
proposed
3 different strategies. Therefore, I think backlog status from a single
source should
be up to the source.

IMO, a better API design is not how to resolve conflicts but not
introducing conflicts.
Let the source determine backlog status removes the conflicts and I don't
see big
disadvantages.

> It should not confuse the user that
> DataStream#assignTimestampsAndWatermarks doesn't work with
> backlog.watermark-lag-threshold, as it is not a source.

Hmm, so this configuration may confuse Flink SQL users, because all
watermarks
are defined on the source DDL, but it may use a separate operator to emit
watermarks
if the source doesn't support emitting watermarks.


> I think the description in the FLIP actually means the other way
> around, where the job can never switch back to batch mode once it has
> switched into streaming mode. This is to align with the current state
> of FLIP-327[1], where only switching from batch to stream mode is
> supported.

This sounds like a limitation of FLIP-327 (that execution mode depends on
backlog status).
But the backlog status shouldn't have this limitation, because it is not
only used for execution
switching.

Best,
Jark



On Fri, 8 Sept 2023 at 19:09, Xuannan Su  wrote:

> Hi Jark and Leonard,
>
> Thanks for the comments. Please see my reply below.
>
> @Jark
>
> > I think a better API doesn't compete with itself. Therefore, I'm in
> favor of
> > supporting the watermark lag threshold for each source without
> introducing
> > any framework API and configuration.
>
> I don't think supporting the watermark lag threshold for each source
> can avoid the competition problem. In the case where a user wants to
> use a CDC source and also determine backlog status based on watermark
> lag, we still need to define the rule when that occurs. With that
> said, I think it is more intuitive to combine it with the logical OR
> operation, as the strategies (FLIP-309, FLIP-328) only determine when
> the source's backlog status should be True. What do you think?
>
> > Besides, this can address another concern that the watermark may be
> > generated by DataStream#assignTimestampsAnd
> > Watermarks which doesn't
> > work with the backlog.watermark-lag-threshold job config
>
> The description of the configuration explicitly states that "a source
> would report isProcessingBacklog=true if its watermark lag exceeds the
> configured value". It should not confuse the user that
> DataStream#assignTimestampsAndWatermarks doesn't work with
> backlog.watermark-lag-threshold, as it is not a source.
>
> > Does that mean the job can never back to streaming mode once switches
> into
> > backlog mode? It sounds like not a complete FLIP to me. Is it possible to
> > support switching back in this FLIP?
>
> I think the description in the FLIP actually means the other way
> around, where the job can never switch back to batch mode once it has
> switched into streaming mode. This is to align with the current state
> of FLIP-327[1], where only switching from batch to stream mode is
> supported.
>
> @Leonard
>
> > > The FLIP describe that: And it should report isProcessingBacklog=false
> at the beginning of the snapshot stage.
> > This should be “changelog stage”
>
> I think the description is in FLIP-309. Thanks for pointing out. I
> updated the description.
>
> > I'm not sure if it's enough to support this feature only in FLIP-27
> Source. Although we are pushing the sourceFunction API to be removed, these
> APIs will be survive one or two versions in flink repo before they are
> actually removed.
>
> I agree that it is good to support the SourceFunction API. However,
> given that the SourceFunction API is marked as deprecated, I think I
> will prioritize supporting the FLIP-27 Source. We can support the
> SourceFunction API after the
> FLIP-27 source. What do you think?
>
> Best regards,
> Xuannan
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
>
>
>
>
> On Fri, Sep 8, 2023 at 1:02 AM Leonard Xu  wrote:
> >
> > Thanks Xuannan for driving this FLIP !
> >
> > The proposal generally looks good to me, but I still left some comments:
> >
> > > One more question about the FLIP is that the 

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-09-09 Thread Jing Ge
Hi,

Thanks for bringing this to our attention. At the first glance, it looks
reasonable to offer a new configuration to enable/disable SQL hints
globally. However, IMHO, it is not the right timing to do it now, because
we should not only think as platform providers but also as end users(the
number of end users are much bigger than platform providers):

1. Users don't need it because users have the choice to use hints or not,
just like Jark pointed out. With this configuration, there will be a fight
between platform providers and users which will cause more confusions and
conflicts. And users will probably win, IMHO, because they are the end
customers that use Flink to create business values.
2. SQL hints could be considered as an additional feature for users to
control, to optimize the execution plan without touching the internal
logic, i.e. features for advanced use cases and i.e. don't use it if you
don't understand it.
3. Before the system is smart enough to take over(where we are now,
fortunately and unfortunately :-))), there should be a way for users to do
such tuning, even if it is a temporary phase from a
long-term's perspective, i.e. just because it is a temporary solution, does
not mean it is not necessary for now.
4. What if users write wrong hints? Well, the code review process is
recommended. Someone who truly understands hints should double check it
before hints are merged to the master or submitted to the production env.
Just like a common software development process.

Just my two cents.

Best regards,
Jing

On Thu, Sep 7, 2023 at 10:02 PM Bonnie Arogyam Varghese
 wrote:

> Hi Liu,
>  The default will be set to enabled which is the current behavior. The
> option will allow users/platform providers to disable it if they want to.
>
> On Wed, Sep 6, 2023 at 6:39 PM liu ron  wrote:
>
> > Hi, Boonie
> >
> > I'm with Jark on why disable hint is needed if it won't affect security.
> If
> > users don't need to use hint, then they won't care about it and I don't
> > think it's going to be a nuisance. On top of that, Lookup Join Hint is
> very
> > useful for streaming jobs, and disabling the hint would result in users
> not
> > being able to use it.
> >
> > Best,
> > Ron
> >
> > Bonnie Arogyam Varghese  于2023年9月6日周三
> > 23:52写道:
> >
> > > Hi Liu Ron,
> > >  To answer your question,
> > >Security might not be the main reason for disabling this option but
> > > other arguments brought forward by Timo. Let me know if you have any
> > > further questions or concerns.
> > >
> > > On Tue, Sep 5, 2023 at 9:35 PM Bonnie Arogyam Varghese <
> > > bvargh...@confluent.io> wrote:
> > >
> > > > It looks like it will be nice to have a config to disable hints. Any
> > > other
> > > > thoughts/concerns before we can close this discussion?
> > > >
> > > > On Fri, Aug 18, 2023 at 7:43 AM Timo Walther 
> > wrote:
> > > >
> > > >>  > lots of the streaming SQL syntax are extensions of SQL standard
> > > >>
> > > >> That is true. But hints are kind of a special case because they are
> > not
> > > >> even "part of Flink SQL" that's why they are written in a comment
> > > syntax.
> > > >>
> > > >> Anyway, I feel hints could be sometimes confusing for users because
> > most
> > > >> of them have no effect for streaming and long-term we could also set
> > > >> some hints via the CompiledPlan. And if you have multiple teams,
> > > >> non-skilled users should not play around with hints and leave the
> > > >> decision to the system that might become smarter over time.
> > > >>
> > > >> Regards,
> > > >> Timo
> > > >>
> > > >>
> > > >> On 17.08.23 18:47, liu ron wrote:
> > > >> > Hi, Bonnie
> > > >> >
> > > >> >> Options hints could be a security concern since users can
> override
> > > >> > settings.
> > > >> >
> > > >> > I think this still doesn't answer my question
> > > >> >
> > > >> > Best,
> > > >> > Ron
> > > >> >
> > > >> > Jark Wu  于2023年8月17日周四 19:51写道:
> > > >> >
> > > >> >> Sorry, I still don't understand why we need to disable the query
> > > hint.
> > > >> >> It doesn't have the security problems as options hint. Bonnie
> said
> > it
> > > >> >> could affect performance, but that depends on users using it
> > > >> explicitly.
> > > >> >> If there is any performance problem, users can remove the hint.
> > > >> >>
> > > >> >> If we want to disable query hint just because it's an extension
> to
> > > SQL
> > > >> >> standard.
> > > >> >> I'm afraid we have to introduce a bunch of configuration, because
> > > lots
> > > >> of
> > > >> >> the streaming SQL syntax are extensions of SQL standard.
> > > >> >>
> > > >> >> Best,
> > > >> >> Jark
> > > >> >>
> > > >> >> On Thu, 17 Aug 2023 at 15:43, Timo Walther 
> > > wrote:
> > > >> >>
> > > >> >>> +1 for this proposal.
> > > >> >>>
> > > >> >>> Not every data team would like to enable hints. Also because
> they
> > > are
> > > >> an
> > > >> >>> extension to the SQL standard. It might also be the case that
> > custom
> > > >> >>> rules would be overwritten otherwise. Se

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-09-09 Thread Jark Wu
I agree with Jing,

My biggest concern is this makes the boundary of adding an option very
unclear.
It's not a strong reason to add a config just because of it doesn't affect
existing
users. Does this mean that in the future we might add an option to disable
each feature?

Flink already has a very long list of configurations [1][2] and this is
very scary
and not easy to use. We should try to remove the unnecessary configuration
from
the list in Flink 2.0. However, from my perspective, adding this option
makes us far
away from this direction.

Best,
Jark

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/

On Sat, 9 Sept 2023 at 17:33, Jing Ge  wrote:

> Hi,
>
> Thanks for bringing this to our attention. At the first glance, it looks
> reasonable to offer a new configuration to enable/disable SQL hints
> globally. However, IMHO, it is not the right timing to do it now, because
> we should not only think as platform providers but also as end users(the
> number of end users are much bigger than platform providers):
>
> 1. Users don't need it because users have the choice to use hints or not,
> just like Jark pointed out. With this configuration, there will be a fight
> between platform providers and users which will cause more confusions and
> conflicts. And users will probably win, IMHO, because they are the end
> customers that use Flink to create business values.
> 2. SQL hints could be considered as an additional feature for users to
> control, to optimize the execution plan without touching the internal
> logic, i.e. features for advanced use cases and i.e. don't use it if you
> don't understand it.
> 3. Before the system is smart enough to take over(where we are now,
> fortunately and unfortunately :-))), there should be a way for users to do
> such tuning, even if it is a temporary phase from a
> long-term's perspective, i.e. just because it is a temporary solution, does
> not mean it is not necessary for now.
> 4. What if users write wrong hints? Well, the code review process is
> recommended. Someone who truly understands hints should double check it
> before hints are merged to the master or submitted to the production env.
> Just like a common software development process.
>
> Just my two cents.
>
> Best regards,
> Jing
>
> On Thu, Sep 7, 2023 at 10:02 PM Bonnie Arogyam Varghese
>  wrote:
>
> > Hi Liu,
> >  The default will be set to enabled which is the current behavior. The
> > option will allow users/platform providers to disable it if they want to.
> >
> > On Wed, Sep 6, 2023 at 6:39 PM liu ron  wrote:
> >
> > > Hi, Boonie
> > >
> > > I'm with Jark on why disable hint is needed if it won't affect
> security.
> > If
> > > users don't need to use hint, then they won't care about it and I don't
> > > think it's going to be a nuisance. On top of that, Lookup Join Hint is
> > very
> > > useful for streaming jobs, and disabling the hint would result in users
> > not
> > > being able to use it.
> > >
> > > Best,
> > > Ron
> > >
> > > Bonnie Arogyam Varghese  于2023年9月6日周三
> > > 23:52写道:
> > >
> > > > Hi Liu Ron,
> > > >  To answer your question,
> > > >Security might not be the main reason for disabling this option
> but
> > > > other arguments brought forward by Timo. Let me know if you have any
> > > > further questions or concerns.
> > > >
> > > > On Tue, Sep 5, 2023 at 9:35 PM Bonnie Arogyam Varghese <
> > > > bvargh...@confluent.io> wrote:
> > > >
> > > > > It looks like it will be nice to have a config to disable hints.
> Any
> > > > other
> > > > > thoughts/concerns before we can close this discussion?
> > > > >
> > > > > On Fri, Aug 18, 2023 at 7:43 AM Timo Walther 
> > > wrote:
> > > > >
> > > > >>  > lots of the streaming SQL syntax are extensions of SQL standard
> > > > >>
> > > > >> That is true. But hints are kind of a special case because they
> are
> > > not
> > > > >> even "part of Flink SQL" that's why they are written in a comment
> > > > syntax.
> > > > >>
> > > > >> Anyway, I feel hints could be sometimes confusing for users
> because
> > > most
> > > > >> of them have no effect for streaming and long-term we could also
> set
> > > > >> some hints via the CompiledPlan. And if you have multiple teams,
> > > > >> non-skilled users should not play around with hints and leave the
> > > > >> decision to the system that might become smarter over time.
> > > > >>
> > > > >> Regards,
> > > > >> Timo
> > > > >>
> > > > >>
> > > > >> On 17.08.23 18:47, liu ron wrote:
> > > > >> > Hi, Bonnie
> > > > >> >
> > > > >> >> Options hints could be a security concern since users can
> > override
> > > > >> > settings.
> > > > >> >
> > > > >> > I think this still doesn't answer my question
> > > > >> >
> > > > >> > Best,
> > > > >> > Ron
> > > > >> >
> > > > >> > Jark Wu  于2023年8月17日周四 19:51写道:
> > > > >> >
> > > > >> >> Sorry, I still don't understand why we need to disable the
> query
> >