Re: [ANNOUNCE] Performance Daily Monitoring Moved from Ververica to Apache Flink Slack Channel

2023-06-26 Thread 柳尘
Thanks, Yanfei, to drive this and make the performance monitoring publicly
available.

I'd like to join as a maintainer.
Looking forward to the workflow.

Best
xingyuan




yanfei lei  于2022年10月26日周三 11:33写道:

> Hi everyone,
>
> As discussed earlier, we plan to create a benchmark channel in Apache Flink
> slack[1], but the plan was shelved for a while[2]. So I went on with this
> work, and created the #flink-dev-benchmarks channel for performance
> regression notifications.
>
> We have a regression report script[3] that runs daily, and a notification
> would be sent to the slack channel when the last few benchmark results are
> significantly worse than the baseline.
> Note, regressions are detected by a simple script which may have false
> positives and false negatives. And all benchmarks are executed on one
> physical machine[4] which is provided by Ververica(Alibaba)[5], it might
> happen that hardware issues affect performance, like "[FLINK-18614
> ] Performance
> regression
> 2020.07.13"[6].
>
> After the migration, we need a procedure to watch over the entire
> performance of Flink code together. For example, if a regression
> occurs, investigating the cause and resolving the problem are needed. In
> the past, this procedure is maintained internally within Ververica, but we
> think making the procedure public would benefit all. I volunteer to serve
> as one of the initial maintainers, and would be glad if more contributors
> can join me. I'd also prepare some guidelines to help others get familiar
> with the workflow. I will start a new thread to discuss the workflow soon.
>
>
> [1] https://www.mail-archive.com/dev@flink.apache.org/msg58666.html
> [2] https://issues.apache.org/jira/browse/FLINK-28468
> [3]
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> [4] http://codespeed.dak8s.net:8080
> [5] https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
>
> [6] https://issues.apache.org/jira/browse/FLINK-18614
>


[jira] [Created] (FLINK-32435) Merge testing implementations of LeaderContender

2023-06-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-32435:
-

 Summary: Merge testing implementations of LeaderContender
 Key: FLINK-32435
 URL: https://issues.apache.org/jira/browse/FLINK-32435
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Tests
Reporter: Matthias Pohl


We have several testing implementations of the {{LeaderContender}} interface. 
We could merge all of them into a single {{TestingLeaderContender}} 
implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32436) Remove obsolete LeaderContender.getDescription method

2023-06-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-32436:
-

 Summary: Remove obsolete LeaderContender.getDescription method
 Key: FLINK-32436
 URL: https://issues.apache.org/jira/browse/FLINK-32436
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Matthias Pohl


{{LeaderContender.getDescription}} was barely used (only for the log output in 
the ZK driver implementation). With the {{contenderID}} becoming a more 
fundamental property of the {{DefaultLeaderElectionService}} we can get rid of 
the {{getDescription}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

2023-06-26 Thread Leonard Xu
Thanks Dong for driving this FLIP forward!

Introducing  `backlog status` concept for flink job makes sense to me as 
following reasons:

From concept/API design perspective, it’s more general and natural than above 
proposals as it can be used in HybridSource for bounded records, CDC Source for 
history snapshot and general sources like KafkaSource for historical messages.  

From user cases/requirements, I’ve seen many users manually to set larger 
checkpoint interval during backfilling and then set a shorter checkpoint 
interval for real-time processing in their production environments as a flink 
application optimization. Now, the flink framework can make this optimization 
no longer require the user to set the checkpoint interval and restart the job 
multiple times.

Following supporting using larger checkpoint for job under backlog status in 
current FLIP, we can explore supporting larger parallelism/memory/cpu for job 
under backlog status in the future.  

In short, the updated FLIP looks good to me.


Best,
Leonard


> On Jun 22, 2023, at 12:07 PM, Dong Lin  wrote:
> 
> Hi Piotr,
> 
> Thanks again for proposing the isProcessingBacklog concept.
> 
> After discussing with Becket Qin and thinking about this more, I agree it
> is a better idea to add a top-level concept to all source operators to
> address the target use-case.
> 
> The main reason that changed my mind is that isProcessingBacklog can be
> described as an inherent/nature attribute of every source instance and its
> semantics does not need to depend on any specific checkpointing policy.
> Also, we can hardcode the isProcessingBacklog behavior for the sources we
> have considered so far (e.g. HybridSource and MySQL CDC source) without
> asking users to explicitly configure the per-source behavior, which indeed
> provides better user experience.
> 
> I have updated the FLIP based on the latest suggestions. The latest FLIP no
> longer introduces per-source config that can be used by end-users. While I
> agree with you that CheckpointTrigger can be a useful feature to address
> additional use-cases, I am not sure it is necessary for the use-case
> targeted by FLIP-309. Maybe we can introduce CheckpointTrigger separately
> in another FLIP?
> 
> Can you help take another look at the updated FLIP?
> 
> Best,
> Dong
> 
> 
> 
> On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski 
> wrote:
> 
>> Hi Dong,
>> 
>>> Suppose there are 1000 subtask and each subtask has 1% chance of being
>>> "backpressured" at a given time (due to random traffic spikes). Then at
>> any
>>> given time, the chance of the job
>>> being considered not-backpressured = (1-0.01)^1000. Since we evaluate the
>>> backpressure metric once a second, the estimated time for the job
>>> to be considered not-backpressured is roughly 1 / ((1-0.01)^1000) = 23163
>>> sec = 6.4 hours.
>>> 
>>> This means that the job will effectively always use the longer
>>> checkpointing interval. It looks like a real concern, right?
>> 
>> Sorry I don't understand where you are getting those numbers from.
>> Instead of trying to find loophole after loophole, could you try to think
>> how a given loophole could be improved/solved?
>> 
>>> Hmm... I honestly think it will be useful to know the APIs due to the
>>> following reasons.
>> 
>> Please propose something. I don't think it's needed.
>> 
>>> - For the use-case mentioned in FLIP-309 motivation section, would the
>> APIs
>>> of this alternative approach be more or less usable?
>> 
>> Everything that you originally wanted to achieve in FLIP-309, you could do
>> as well in my proposal.
>> Vide my many mentions of the "hacky solution".
>> 
>>> - Can these APIs reliably address the extra use-case (e.g. allow
>>> checkpointing interval to change dynamically even during the unbounded
>>> phase) as it claims?
>> 
>> I don't see why not.
>> 
>>> - Can these APIs be decoupled from the APIs currently proposed in
>> FLIP-309?
>> 
>> Yes
>> 
>>> For example, if the APIs of this alternative approach can be decoupled
>> from
>>> the APIs currently proposed in FLIP-309, then it might be reasonable to
>>> work on this extra use-case with a more advanced/complicated design
>>> separately in a followup work.
>> 
>> As I voiced my concerns previously, the current design of FLIP-309 would
>> clog the public API and in the long run confuse the users. IMO It's
>> addressing the
>> problem in the wrong place.
>> 
>>> Hmm.. do you mean we can do the following:
>>> - Have all source operators emit a metric named "processingBacklog".
>>> - Add a job-level config that specifies "the checkpointing interval to be
>>> used when any source is processing backlog".
>>> - The JM collects the "processingBacklog" periodically from all source
>>> operators and uses the newly added config value as appropriate.
>> 
>> Yes.
>> 
>>> The challenge with this approach is that we need to define the semantics
>> of
>>> this "processingBacklog" metric and have all source operators
>>> implement this m

[jira] [Created] (FLINK-32437) Determine and set correct maxParallelism for operator chains

2023-06-26 Thread Stefan Richter (Jira)
Stefan Richter created FLINK-32437:
--

 Summary: Determine and set correct maxParallelism for operator 
chains
 Key: FLINK-32437
 URL: https://issues.apache.org/jira/browse/FLINK-32437
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Reporter: Stefan Richter
Assignee: Stefan Richter
 Fix For: 1.19.0


Current code in {{StreamingJobGraphGenerator}} does not properly determine and 
set the correct maxParallelism of operator chains. We should set the 
maxParallelism of the chain as the minimum of all the maxParallelism values 
among operators in the chain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing

2023-06-26 Thread Dunn Bangui
Hi yuxia,

Thank you for reviewing the code.

Specifically, when users set
*'sink.partition-commit.policy.class.parameters=''*, the 'parameters' will
not be null but will be an empty list.
In this case, it may still invoke the constructor with empty arguments.

To address this, maybe we can use the condition *'parameters != null &&
!parameters.isEmpty()'* to ensure that the 'parameters' list is not empty.
What are your thoughts on this matter? I would appreciate your thoughts. : )

Best regards,
Bangui Dunn

yuxia  于2023年6月25日周日 11:07写道:

> Hi, Bangui Dunn
> Review done.
> But please remember we should reach consensus about it in the dicsussion
> before we can merge it.
> Let's keep the discussion for a while to see any further feedback.
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "Dunn Bangui" 
> 收件人: "dev" 
> 发送时间: 星期日, 2023年 6 月 25日 上午 10:34:50
> 主题: Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
>
> Hi, Yuxia.
>
> Could you please provide me with an update on the review process? I am
> eager to move forward and would appreciate any guidance or feedback you can
> provide.
>
> Best regards, late Dragon Boat Festival blessing : )
> Bangui Dunn
>
> Gary Ccc  于2023年6月21日周三 14:44写道:
>
> > Hi, Yuxia.
> >
> > Thank you for your suggestion! I agree with your point and have made the
> > necessary modification.
> > If you have any additional feedback or further suggestions, please feel
> > free to let me know. : )
> >
> > Best regards,
> > Bangui Dunn
> >
> > yuxia  于2023年6月21日周三 10:56写道:
> >
> >> Correct what I said in the previous email:
> >>
> >> "But will it better to make it asList ? Something
> >> like:`.stringType().asList()`."  =>  "But will it better to make it no
> >> default value? Something like:`.stringType().asList().noDefaultValue`."
> >>
> >> Best regards,
> >> Yuxia
> >>
> >> - 原始邮件 -
> >> 发件人: "yuxia" 
> >> 收件人: "dev" 
> >> 发送时间: 星期三, 2023年 6 月 21日 上午 10:25:47
> >> 主题: Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
> >>
> >> Hi, Bangui Dunn.
> >> Thanks for reaching us out.
> >> Generally + 1 for the configuration.  But will it better to make it
> >> asList? Something like:
> >> `.stringType().asList()`.
> >>
> >> Best regards,
> >> Yuxia
> >>
> >> - 原始邮件 -
> >> 发件人: "Gary Ccc" 
> >> 收件人: "dev" 
> >> 发送时间: 星期二, 2023年 6 月 20日 下午 5:44:10
> >> 主题: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
> >>
> >> To the Apache Flink Community Members,
> >> I hope this email finds you well. I am writing to discuss a potential
> >> improvement for the implementation of a custom commit policy in Flink.
> >>
> >> Background:
> >> We have encountered challenges in utilizing a custom commit policy due
> to
> >> the inability to pass parameters.
> >> This limitation restricts our ability to add additional functionality to
> >> the commit policy, such as monitoring the files associated with each
> >> commit.
> >>
> >> Purpose:
> >> The purpose of this improvement is to allow the passing of parameters to
> >> the custom PartitionCommitPolicy. By enabling parameter passing, users
> can
> >> extend the functionality of their custom commit policy.
> >>
> >> Example:
> >> Suppose we have a custom commit policy called "MyPolicy" that requires
> >> parameters such as "key" and "url" for proper functionality.
> >> Currently, it is not possible to pass these parameters when using a
> custom
> >> commit policy.
> >> However, by introducing the concept of
> >> SINK_PARTITION_COMMIT_POLICY_CLASS_PARAMETERS, users can now pass
> >> parameters in the following way:
> >>
> >> By adding SINK_PARTITION_COMMIT_POLICY_CLASS_PARAMETERS, you can pass
> >> parameters when using a custom commit policy, for example
> >> 'sink.partition-commit.policy.kind'='custom',
> >> 'sink.partition-commit.policy.class’='MyPolicy',
> >> 'sink.partition-commit.policy.class.parameters’=‘key;url'
> >>
> >> Eeffect:
> >> By adding PartitionCommitPolicyFactory constructor, to ensure backward
> >> compatibility for existing user programs.
> >>
> >> Code PR:
> >> To support this improvement, I have submitted a pull request with the
> >> necessary code changes.
> >> You can find the details of the pull request at the following link:
> >> https://github.com/apache/flink/pull/22831/files
> >>
> >> Best regards,
> >> Bangui Dunn
> >>
> >
>


Re: [DISCUSS] Graduate the FileSink to @PublicEvolving

2023-06-26 Thread Jing Ge
Hi,

@Galen @Yuxia

Your points are valid. Speaking of removing deprecated API, I have the same
concern. As a matter of fact, I have been raising it in the discussion
thread of API deprecation process[1]. This is another example that we
should care about more factors than the migration period, thanks for
the hint! I will add one more update into that thread with the reference of
this thread.

In a nutshell, this thread is focusing on the graduation process. Your
valid concerns should be taken care of by the deprecation process.
Please don't hesitate to share your thoughts in that thread.


Best regards,
Jing

[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9


On Sun, Jun 25, 2023 at 3:48 AM yuxia  wrote:

> Thanks Jing for briging this to dicuss.
> I agree it's not a blocker for graduting the FileSink to @PublicEvolving
> since the Sink which is the rootcause has marked as @PublicEvolving.
> But I do also share the same concern with Galen. At least it should be a
> blocker for removing StreamingFileSink.
> Btw, seems it's really a big headache for migrating to Sink, we may need
> to pay more attention to this ticket and try to fix it.
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "Galen Warren" 
> 收件人: "dev" 
> 发送时间: 星期五, 2023年 6 月 23日 下午 7:47:24
> 主题: Re: [DISCUSS] Graduate the FileSink to @PublicEvolving
>
> Thanks Jing. I can only offer my perspective on this, others may view it
> differently.
>
> If FileSink is subject to data loss in the "stop-on-savepoint then restart"
> scenario, that makes it unusable for me, and presumably for anyone who uses
> it in a long-running streaming application and who cannot tolerate data
> loss. I still use the (deprecated!) StreamingFileSink for this reason.
>
> The bigger picture here is that StreamingFileSink is deprecated and will
> presumably ultimately be removed, to be replaced with FileSink. Graduating
> the status of FileSink seems to be a step along that path; I'm concerned
> about continuing down that path with such a critical issue present.
> Ultimately, my concern is that FileSink will graduate fully and that
> StreamingFileSink will be removed and that there will be no remaining
> option to reliably stop/start streaming jobs that write to files without
> incurring the risk of data loss.
>
> I'm sure I'd feel better about things if there were an ongoing effort to
> address this FileSink issue and/or a commitment that StreamingFileSink
> would not be removed until this issue is addressed.
>
> My two cents -- thanks.
>
>
> On Fri, Jun 23, 2023 at 1:47 AM Jing Ge 
> wrote:
>
> > Hi Galen,
> >
> > Thanks for the hint which is helpful for us to have a clear big picture.
> > Afaiac, this will not be a blocking issue for the graduation. There will
> > always be some (potential) bugs in the implementation. The API is very
> > stable from 2020. The timing is good to graduate. WDYT?
> > Furthermore, I'd like to have more opinions. All opinions together will
> > help the community build a mature API graduation process.
> >
> > Best regards,
> > Jing
> >
> > On Tue, Jun 20, 2023 at 12:48 PM Galen Warren
> >  wrote:
> >
> > > Is this issue still unresolved?
> > >
> > >
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLINK-30238
> > >
> > > Based on prior discussion, I believe this could lead to data loss with
> > > FileSink.
> > >
> > >
> > >
> > > On Tue, Jun 20, 2023, 5:41 AM Jing Ge 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > The FileSink has been marked as @Experimental[1] since Oct. 2020.
> > > > According to FLIP-197[2], I would like to propose to graduate it
> > > > to @PublicEvloving in the upcoming 1.18 release.
> > > >
> > > > On the other hand, as a related topic, FileSource was marked
> > > > as @PublicEvolving[3] 3 years ago. It deserves a graduation
> discussion
> > > too.
> > > > To keep this discussion lean and efficient, let's focus on FlieSink
> in
> > > this
> > > > thread. There will be another discussion thread for the FileSource.
> > > >
> > > > I was wondering if anyone might have any concerns. Looking forward to
> > > > hearing from you.
> > > >
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/4006de973525c5284e9bc8fa6196ab7624189261/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/sink/FileSink.java#L129
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process
> > > > [3]
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/4006de973525c5284e9bc8fa6196ab7624189261/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/FileSource.java#L95
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-26 Thread Jing Ge
Hi all,

Just want to make sure we are on the same page. There is another example[1]
I was aware of recently that shows why more factors need to be taken care
of than just the migration period. Thanks Galen for your hint.

To put it simply, the concern about API deprecation is not that deprecated
APIs have been removed too early (min migration period is required). The
major concern is that APIs are marked as deprecated for a (too) long time,
much longer than the migration period discussed in this thread, afaik.
Since there is no clear picture/definition, no one knows when to do the
migration for users(after the migration period has expired) and when to
remove deprecated APIs for Flink developers.

Based on all the information I knew, there are two kinds of obstacles that
will and should block the deprecation process:

1. Lack of functionalities in new APIs. It happens e.g. with the
SourceFunction to FLIP-27 Source migration. Users who rely on those
functions can not migrate to new APIs.
2. new APIs have critical bugs. An example could be found at [1]. Users
have to stick to the deprecated APIs.

Since FLIP-321 is focusing on the API deprecation process, those blocking
issues deserve attention and should be put into the FLIP. The current FLIP
seems to only focus on migration periods. If we consider those blocking
issues as orthogonal issues that are beyond the scope of this discussion,
does it make sense to change the FLIP title to something like "Introduce
minimum migration periods of API deprecation process"?

Best regards,
Jing

[1] https://lists.apache.org/thread/wxoo7py5pqqlz37l4w8jrq6qdvsdq5wc

On Sun, Jun 25, 2023 at 2:01 PM Jark Wu  wrote:

> I agree with Jingsong and Becket.
>
> Look at the legacy SourceFunction (a small part of DataStream API),
> the SourceFunction is still not and can't be marked deprecated[1] until
> now after the new Source was released 2 years ago, because the new Source
> still can't fully consume the abilities of legacy API. Considering
> DataStream
> API is the most fundamental and complex API of Flink, I think it is worth
> a longer time than the general process for the deprecation period to
> wait for the new API be mature. The above 2 options sound a bit of rush
> for such a widely used API.
>
> I fully understand the concern of maintenance overhead, but it's a bit hard
> for others to estimate maintenance costs without a concrete design and code
> of the new ProcessFunction API. I agree with Becket that maybe we can
> re-evaluate the API deprecation process once we have the new
> ProcessFunction
> API. If the maintenance is indeed huge, I think it is reasonable to have a
> special rule for this case at that time.
>
> Best,
> Jark
>
>
> [1]: https://issues.apache.org/jira/browse/FLINK-28045
>
> On Sun, 25 Jun 2023 at 16:22, Becket Qin  wrote:
>
> > Hi Jingsong,
> >
> > Thanks for the reply. I completely agree with you.
> >
> > The above 2 options are based on the assumption that the community cannot
> > afford to maintain the deprecated DataStream API for long. I'd say we
> > should try everything we can to maintain it for as much time as possible.
> > DataStream API is actually the most used API in Flink by so many users at
> > this point. Removing it any time soon will dramatically hurt our users.
> So
> > ideally we should keep it for at least 2 years after deprecation, if not
> > more.
> >
> > The prohibitively high maintenance overhead is just an assumption.
> > Personally speaking, I don't feel this assumption is necessarily true. We
> > should re-evaluate once we have the new ProcessFunction API in place.
> > Without the code it is hard to tell for sure. I am actually kind of
> > optimistic about the maintenance cost.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Sun, Jun 25, 2023 at 11:30 AM Jingsong Li 
> > wrote:
> >
> > > Thanks Becket and all for your discussion.
> > >
> > > > 1. We say this FLIP is enforced starting release 2.0. For current 1.x
> > > APIs,
> > > we provide a migration period with best effort, while allowing
> exceptions
> > > for immediate removal in 2.0. That means we will still try with best
> > effort
> > > to get the ProcessFuncion API ready and deprecate the DataStream API in
> > > 1.x, but will also be allowed to remove DataStream API in 2.0 if it's
> not
> > > deprecated 2 minor releases before the major version bump.
> > >
> > > > 2. We strictly follow the process in this FLIP, and will quickly bump
> > the
> > > major version from 2.x to 3.0 once the migration period for DataStream
> > API
> > > is reached.
> > >
> > > Sorry, I didn't read the previous detailed discussion because the
> > > discussion list was so long.
> > >
> > > I don't really like either of these options.
> > >
> > > Considering that DataStream is such an important API, can we offer a
> > third
> > > option:
> > >
> > > 3. Maintain the DataStream API throughout 2.X and remove it until 3.x.
> > But
> > > there's no need to assume that 2.X is a sh

Re: [DISCUSS] Release 2.0 Work Items

2023-06-26 Thread Chesnay Schepler

by-and-large I'm quite happy with the list of items.

I'm curious as to why the "Disaggregated State Management" item is 
marked as a must-have; will it require changes that break something? 
What prevents it from being added in 2.1?


We may want to update the Java 17 item to "Make Java 17 the default, 
drop Java 8/11". Maybe even split it into a must-have "Drop Java 8" and 
a nice-to-have "Drop Java 11"?


"Move Calcite rules from Scala to Java": I would hope that this would be 
an entirely internal change, and could thus be an incremental process 
independent of major releases.

What is the actual scale of this item; how much are we actually re-writing?

"Add MetricGroup#getLogicalScope": I'd raise this to a must-have; i 
think I marked it down as nice-to-have only because it depends on 
another item.


The ProcessFunction API item is giving me the most headaches because 
it's very unclear what it actually entails; like is it an entirely 
separate API to DataStream (sounds like it is!) or an extension of 
DataStream. How much will it share the internals with DataStream etc.; 
how does it relate to the Table API (w.r.t. switching APIs / what Table 
API uses underneath).


There are a few items I added as ideas which don't have a priority yet; 
would love to get some feedback on those.


On 21/06/2023 08:41, Xintong Song wrote:

Hi devs,

As previously discussed in [1], we had been collecting work item proposals
for the 2.0 release until June 15th, on the wiki page [2].

- As we have passed the due date, I'd like to kindly remind everyone *not
to add / remove items directly on the wiki page*. If needed, please post
in this thread or reach out to the release managers instead.
- I've reached out to some folks for clarifications about their
proposals. Some of them mentioned that they can not yet tell whether we
should do an item or not, and would need more time / discussions to make
the decision. So I added a new symbol for items whose priorities are `TBD`.

Now it's time to collaboratively decide a minimum set of must-have items.
I've gone through the entire list of proposed items, and found most of them
make quite much sense. So I think an online sync might not be necessary for
this. I'd like to go with this DISCUSS thread, where everyone can comment
on how they think the list can be improved, followed by a VOTE to formally
make the decision.

Any feedback and opinions, including but not limited to the following
aspects, will be appreciated.

- Important items that are missing from the list
- Concerns regarding the listed items or their priorities

Looking forward to your feedback.

Best,

Xintong


[1]
https://lists.apache.org/list?dev@flink.apache.org:lte=1M:release%202.0%20status%20updates

[2]https://cwiki.apache.org/confluence/display/FLINK/2.0+Release



[jira] [Created] (FLINK-32438) Merge AbstractZooKeeperHaServices and ZooKeeperMultipleComponentLeaderElectionHaServices

2023-06-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-32438:
-

 Summary: Merge AbstractZooKeeperHaServices and 
ZooKeeperMultipleComponentLeaderElectionHaServices
 Key: FLINK-32438
 URL: https://issues.apache.org/jira/browse/FLINK-32438
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Matthias Pohl


{{AbstractZooKeeperHaServices}} isn't needed anymore with the legacy ZK leader 
election being gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-316: Introduce SQL Driver

2023-06-26 Thread Paul Lam
Hi Shengkai,

> * How can we ship the json plan to the JobManager?

The Flink K8s module should be responsible for file distribution. We could 
introduce 
an option like `kubernetes.storage.dir`. For each flink cluster, there would be 
a 
dedicated subdirectory, with the pattern like 
`${kubernetes.storage.dir}/${cluster-id}`.

All resources-related options (e.g. pipeline jars, json plans) that are 
configured with
scheme `file://`  would be uploaded to the resource directory and 
downloaded to the
jobmanager, before SQL Driver accesses the files with the original filenames.


> * Classloading strategy


We could directly specify the SQL Gateway jar as the jar file in 
PackagedProgram.
It would be treated like a normal user jar and the SQL Driver is loaded into 
the user 
classloader. WDYT?

> * Option `$internal.sql-gateway.driver.sql-config` is string type
> I think it's better to use Map type here

By Map type configuration, do you mean a nested map that contains all
configurations? 

I hope I've explained myself well, it’s a file that contains the extra SQL 
configurations, which would be shipped to the jobmanager.

> * PoC branch

Sure. I’ll let you know once I get the job done.

Best,
Paul Lam

> 2023年6月26日 14:27,Shengkai Fang  写道:
> 
> Hi, Paul.
> 
> Thanks for your update. I have a few questions about the new design: 
> 
> * How can we ship the json plan to the JobManager?
> 
> The current design only exposes an option about the URL of the json plan. It 
> seems the gateway is responsible to upload to an external stroage. Can we 
> reuse the PipelineOptions.JARS to ship to the remote filesystem? 
> 
> * Classloading strategy
> 
> Currently, the Driver is in the sql-gateway package. It means the Driver is 
> not in the JM's classpath directly. Because the sql-gateway jar is now in the 
> opt directory rather than lib directory. It may need to add the external 
> dependencies as Python does[1]. BTW, I think it's better to move the Driver 
> into the flink-table-runtime package, which is much easier to find(Sorry for 
> the wrong opinion before).
> 
> * Option `$internal.sql-gateway.driver.sql-config` is string type
> 
> I think it's better to use Map type here
> 
> * PoC branch
> 
> Because this FLIP involves many modules, do you have a PoC branch to verify 
> it does work? 
> 
> Best,
> Shengkai
> 
> [1] 
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L940
>  
> 
> Paul Lam mailto:paullin3...@gmail.com>> 于2023年6月19日周一 
> 14:09写道:
> Hi Shengkai,
> 
> Sorry for my late reply. It took me some time to update the FLIP.
> 
> In the latest FLIP design, SQL Driver is placed in flink-sql-gateway module. 
> PTAL.
> 
> The FLIP does not cover details about the K8s file distribution, but its 
> general usage would
> be very much the same as YARN setups. We could make follow-up discussions in 
> the jira
> tickets.
> 
> Best,
> Paul Lam
> 
>> 2023年6月12日 15:29,Shengkai Fang > > 写道:
>> 
>> 
>> > If it’s the case, I’m good with introducing a new module and making SQL 
>> > Driver
>> > an internal class and accepts JSON plans only.
>> 
>> I rethink this again and again. I think it's better to move the SqlDriver 
>> into the sql-gateway module because the sql client relies on the sql-gateway 
>> to submit the sql and the sql-gateway has the ability to generate the 
>> ExecNodeGraph now. +1 to support accepting JSON plans only.
>> 
>> * Upload configuration through command line parameter
>> 
>> ExecNodeGraph only contains the job's information but it doesn't contain the 
>> checkpoint dir, checkpoint interval, execution mode and so on. So I think we 
>> should also upload the configuration.
>> 
>> * KubernetesClusterDescripter and  KubernetesApplicationClusterEntrypoint 
>> are responsible for the jar upload/download
>> 
>> +1 for the change.
>> 
>> Could you update the FLIP about the current discussion? 
>> 
>> Best,
>> Shengkai
>> 
>> 
>> 
>> 
>> 
>> 
>> Yang Wang mailto:wangyang0...@apache.org>> 
>> 于2023年6月12日周一 11:41写道:
>> Sorry for the late reply. I am in favor of introducing such a built-in
>> resource localization mechanism
>> based on Flink FileSystem. Then FLINK-28915[1] could be the second step
>> which will download
>> the jars and dependencies to the JobManager/TaskManager local directory
>> before working.
>> 
>> The first step could be done in another ticket in Flink. Or some external
>> Flink jobs management system
>> could also take care of this.
>> 
>> [1]. https://issues.apache.org/jira/browse/FLINK-28915 
>> 
>> 
>> Best,
>> Yang
>> 
>> Paul Lam mailto:paullin3...@gmail.com>> 于2023年6月9日周五 
>> 17:39写道:
>> 
>> > Hi Mason,
>> >
>> > I get your point. I'm increasingly feeling the need to introduce a
>> > built-in
>> > file distributi

Re: [VOTE] Release flink-connector-jdbc v3.1.1, release candidate #2

2023-06-26 Thread Leonard Xu
+1 (binding)

- built from source code succeeded
- verified signatures
- verified hashsums 
- checked release notes
- checked the contents contains jar and pom files in apache repo 
- reviewed the web PR 

Best,
Leonard

> On Jun 19, 2023, at 6:57 PM, Danny Cranmer  wrote:
> 
> Thanks for driving this Martijn.
> 
> +1 (binding)
> 
> - Reviewed web PR
> - Jira release notes look good
> - Tag exists in Github
> - Source archive signature/checksum looks good
> - Binary (from Maven) signature/checksum looks good
> - No binaries in the source archive
> - Source archive builds from source and tests pass
> - CI passes [1]
> 
> Non blocking findings:
> - NOTICE files year is 2022 and needs to be updated to 2023
> - pom.xml is referencing Flink 1.17.0 and can be updated to 1.17.1
> - Some unit tests (notably OracleExactlyOnceSinkE2eTest) appear to be
> integration/e2e and are run in the unit test suite
> 
> Thanks,
> Danny
> 
> [1] https://github.com/apache/flink-connector-jdbc/actions/runs/5278297177
> 
> On Thu, Jun 15, 2023 at 12:40 PM Martijn Visser 
> wrote:
> 
>> Hi everyone,
>> Please review and vote on the release candidate #2 for the version 3.1.1,
>> as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>> 
>> 
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2],
>> which are signed with the key with fingerprint
>> A5F3BCE4CBE993573EC5966A65321B8382B219AF [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag v3.1.1-rc2 [5],
>> * website pull request listing the new release [6].
>> 
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>> 
>> Thanks,
>> Release Manager
>> 
>> [1]
>> 
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353281
>> [2]
>> https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.1-rc2
>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapacheflink-1642
>> [5] https://github.com/apache/flink-connector-
>> /releases/tag/v3.1.1-rc2
>> [6] https://github.com/apache/flink-web/pull/654
>> 



[jira] [Created] (FLINK-32439) Kubernetes operator is silently overwriting the "execution.savepoint.path" config

2023-06-26 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-32439:
--

 Summary: Kubernetes operator is silently overwriting the 
"execution.savepoint.path" config
 Key: FLINK-32439
 URL: https://issues.apache.org/jira/browse/FLINK-32439
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Robert Metzger


I recently stumbled across the fact that the K8s operator is silently deleting 
/ overwriting the execution.savepoint.path config option.

I understand why this happens, but I wonder if the operator should write a log 
message if the user configured the execution.savepoint.path option.

And / or add a list to the docs about "Operator managed" config options?

https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L155-L159



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] Apache Flink ML Release 2.3.0, release candidate #1

2023-06-26 Thread Dong Lin
Hi everyone,

We would like to start voting for the Flink ML 2.3.0 release. This
release primarily
provides the ability to run Flink ML on Flink 1.15, 1.16 and 1.17.

Please review and vote on the release candidate #1 for version 2.3.0 of
Apache Flink ML as follows.

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Testing Guideline**

You can find here [1] a page in the project wiki on instructions for
testing.

To cast a vote, it is not necessary to perform all listed checks, but
please mention which checks you have performed when voting.

**Release Overview**

As an overview, the release consists of the following:
a) Flink ML source release to be deployed to dist.apache.org
b) Flink ML Python source distributions to be deployed to PyPI
c) Maven artifacts to be deployed to the Maven Central Repository

**Staging Areas to Review**

The staging areas containing the above-mentioned artifacts are as follows, for
your review:

- All artifacts for a) and b) can be found in the corresponding dev repository
at dist.apache.org [2], which are signed with the key with fingerprint AFAC
DB09 E6F0 FF28 C93D  64BC BEED 4F6C B9F7 7D0E [3]
- All artifacts for c) can be found at the Apache Nexus Repository [4]

**Other links for your review**

- JIRA release notes [5]
- Source code tag "release-2.2.0-rc2" [6]
- PR to update the website Downloads page to include Flink ML links [7]

**Vote Duration**

The voting time will run for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.


Cheers,
Dong


[1] https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+ML+
Release
[2] https://dist.apache.org/repos/dist/dev/flink/flink-ml-2.3.0-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1645/
[5]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353096
[6] https://github.com/apache/flink-ml/releases/tag/release-2.3.0-rc1
[7] https://github.com/apache/flink-web/pull/659


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-26 Thread Xintong Song
>
> Considering DataStream API is the most fundamental and complex API of
> Flink, I think it is worth a longer time than the general process for the
> deprecation period to wait for the new API be mature.
>

This inspires me. In this specific case, compared to how long should
DataStream API be removed after deprecation, it's probably more important
to answer the question how long would ProcessFunction API become mature and
stable after being introduced. According to FLIP-197[1], it requires 4
minor releases by default to promote an @Experimental API to @Public. And
for ProcessFunction API, which aims to replace DataStream API as one of the
most fundamental API of Flink, I'd expect this to take at least the default
time, or even longer. And we probably should wait until we believe
ProcessFunction API is stable to mark DataStream API as deprecated, rather
than as soon as it's introduced. Assuming we introduce the ProcessFunction
API in 2.0, that means we would need to wait for 6 minor releases (4 for
the new API to become stable, and 2 for the migration period) to remove
DataStream API, which is ~2.5 year (assuming 5 months / minor release),
which sounds acceptable for another major version bump.

To wrap things up, it seems to me, sadly, that anyway we cannot avoid the
overhead for maintaining both DataStream & ProcessFunction APIs for at
least 6 minor releases.

Best,

Xintong


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process



On Mon, Jun 26, 2023 at 5:41 PM Jing Ge  wrote:

> Hi all,
>
> Just want to make sure we are on the same page. There is another example[1]
> I was aware of recently that shows why more factors need to be taken care
> of than just the migration period. Thanks Galen for your hint.
>
> To put it simply, the concern about API deprecation is not that deprecated
> APIs have been removed too early (min migration period is required). The
> major concern is that APIs are marked as deprecated for a (too) long time,
> much longer than the migration period discussed in this thread, afaik.
> Since there is no clear picture/definition, no one knows when to do the
> migration for users(after the migration period has expired) and when to
> remove deprecated APIs for Flink developers.
>
> Based on all the information I knew, there are two kinds of obstacles that
> will and should block the deprecation process:
>
> 1. Lack of functionalities in new APIs. It happens e.g. with the
> SourceFunction to FLIP-27 Source migration. Users who rely on those
> functions can not migrate to new APIs.
> 2. new APIs have critical bugs. An example could be found at [1]. Users
> have to stick to the deprecated APIs.
>
> Since FLIP-321 is focusing on the API deprecation process, those blocking
> issues deserve attention and should be put into the FLIP. The current FLIP
> seems to only focus on migration periods. If we consider those blocking
> issues as orthogonal issues that are beyond the scope of this discussion,
> does it make sense to change the FLIP title to something like "Introduce
> minimum migration periods of API deprecation process"?
>
> Best regards,
> Jing
>
> [1] https://lists.apache.org/thread/wxoo7py5pqqlz37l4w8jrq6qdvsdq5wc
>
> On Sun, Jun 25, 2023 at 2:01 PM Jark Wu  wrote:
>
> > I agree with Jingsong and Becket.
> >
> > Look at the legacy SourceFunction (a small part of DataStream API),
> > the SourceFunction is still not and can't be marked deprecated[1] until
> > now after the new Source was released 2 years ago, because the new Source
> > still can't fully consume the abilities of legacy API. Considering
> > DataStream
> > API is the most fundamental and complex API of Flink, I think it is worth
> > a longer time than the general process for the deprecation period to
> > wait for the new API be mature. The above 2 options sound a bit of rush
> > for such a widely used API.
> >
> > I fully understand the concern of maintenance overhead, but it's a bit
> hard
> > for others to estimate maintenance costs without a concrete design and
> code
> > of the new ProcessFunction API. I agree with Becket that maybe we can
> > re-evaluate the API deprecation process once we have the new
> > ProcessFunction
> > API. If the maintenance is indeed huge, I think it is reasonable to have
> a
> > special rule for this case at that time.
> >
> > Best,
> > Jark
> >
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-28045
> >
> > On Sun, 25 Jun 2023 at 16:22, Becket Qin  wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the reply. I completely agree with you.
> > >
> > > The above 2 options are based on the assumption that the community
> cannot
> > > afford to maintain the deprecated DataStream API for long. I'd say we
> > > should try everything we can to maintain it for as much time as
> possible.
> > > DataStream API is actually the most used API in Flink by so many users
> at
> > > this point. Removing it any time soon will dramatically hurt our users.

Re: [DISCUSS] Flink REST API improvements

2023-06-26 Thread David Morávek
Hi Hong,

Thanks for starting the discussion.

seems to be using the cached version of the entire Execution graph (stale
> data), when it could just use the CheckpointStatsCache directly


CheckpointStatsCache is also populated using the "cached execution graph,"
so there is nothing to gain from the "staleness" pov; see
AbstractCheckpointHandler for more details.

Anyone aware of a reason we don’t do this already?
>

The CheckpointStatsCache is populated lazily on the request for a
particular checkpoint (so it might not have a full view); the used data
structure is also slightly different; one more thing is that
CheckpointStatsCache is meant for different purpose -> keeping a particular
checkpoint around while it's being investigated. Otherwise, it might
expire; using it for "overview" would break this.

Configuration for web.refresh-interval controls both dashboard refresh rate
> and ExecutionGraph cache
>

This sounds reasonable as long as it falls back to "web.refresh-interval"
when not defined. For consistency reasons, it should be also named
"rest.cache-timeout"


> Cache-Control on the HTTP headers.
>

In general, I'd be in favor of this ("rest.cache-timeout" would then need
to become "rest.default-cache-timeout"), but I need to see a detailed FLIP
because in my mind this could get quite complicated.

Best,
D.

On Fri, Jun 23, 2023 at 6:26 PM Teoh, Hong 
wrote:

> Hi all,
>
> I have been looking at the Flink REST API implementation, and had some
> question on potential improvements. Looking to gather some thoughts:
>
> 1. Only use what is necessary. The GET /checkpoints API seems to be using
> the cached version of the entire Execution graph (stale data), when it
> could just use the CheckpointStatsCache directly. I am thinking of doing
> this refactoring. Anyone aware of a reason we don’t do this already?
> 2. Configuration for web.refresh-interval controls both dashboard refresh
> rate and ExecutionGraph cache. I am thinking of introducing a new
> configuration, rest.cache.timeout
> 3. Cache-Control on the HTTP headers. Seems like we are using caches in
> our REST endpoint. It would be step in the right direction to introduce
> cache-control in our REST API headers, so that we can improve the
> programmatic access of the Flink REST API.
>
>
> Looking forwards to hearing people’s thoughts.
>
> Regards,
> Hong
>
>


Re: [DISCUSS] Release 2.0 Work Items

2023-06-26 Thread Xintong Song
>
> The ProcessFunction API item is giving me the most headaches because it's
> very unclear what it actually entails; like is it an entirely separate API
> to DataStream (sounds like it is!) or an extension of DataStream. How much
> will it share the internals with DataStream etc.; how does it relate to the
> Table API (w.r.t. switching APIs / what Table API uses underneath).
>

I totally understand your confusion. We started planning this after kicking
off the release 2.0, so there's still a lot to be explored and the plan
keeps changing.


   - In the beginning, we planned to do an in-place refactor of DataStream
   API, until the API migration period is proposed.
   - Then we want to make it an entirely separate API to DataStream, and
   listed as a must-have for release 2.0 so that we can remove DataStream once
   it's ready.
   - However, depending on the outcome of the API compatibility discussion
   [1], we may not be able to remove DataStream in 2.0 anyway, which means we
   might need to re-evaluate the necessity of this item for 2.0.

I'd say we wait a bit longer for the compatibility discussion [1] and
decide the priority for this item afterwards.


Best,

Xintong


[1] https://lists.apache.org/list.html?dev@flink.apache.org


On Mon, Jun 26, 2023 at 6:00 PM Chesnay Schepler  wrote:

> by-and-large I'm quite happy with the list of items.
>
> I'm curious as to why the "Disaggregated State Management" item is marked
> as a must-have; will it require changes that break something? What prevents
> it from being added in 2.1?
>
> We may want to update the Java 17 item to "Make Java 17 the default, drop
> Java 8/11". Maybe even split it into a must-have "Drop Java 8" and a
> nice-to-have "Drop Java 11"?
>
> "Move Calcite rules from Scala to Java": I would hope that this would be
> an entirely internal change, and could thus be an incremental process
> independent of major releases.
> What is the actual scale of this item; how much are we actually re-writing?
>
> "Add MetricGroup#getLogicalScope": I'd raise this to a must-have; i think
> I marked it down as nice-to-have only because it depends on another item.
>
> The ProcessFunction API item is giving me the most headaches because it's
> very unclear what it actually entails; like is it an entirely separate API
> to DataStream (sounds like it is!) or an extension of DataStream. How much
> will it share the internals with DataStream etc.; how does it relate to the
> Table API (w.r.t. switching APIs / what Table API uses underneath).
>
> There are a few items I added as ideas which don't have a priority yet;
> would love to get some feedback on those.
>
> On 21/06/2023 08:41, Xintong Song wrote:
>
> Hi devs,
>
> As previously discussed in [1], we had been collecting work item proposals
> for the 2.0 release until June 15th, on the wiki page [2].
>
>- As we have passed the due date, I'd like to kindly remind everyone *not
>to add / remove items directly on the wiki page*. If needed, please post
>in this thread or reach out to the release managers instead.
>- I've reached out to some folks for clarifications about their
>proposals. Some of them mentioned that they can not yet tell whether we
>should do an item or not, and would need more time / discussions to make
>the decision. So I added a new symbol for items whose priorities are `TBD`.
>
> Now it's time to collaboratively decide a minimum set of must-have items.
> I've gone through the entire list of proposed items, and found most of them
> make quite much sense. So I think an online sync might not be necessary for
> this. I'd like to go with this DISCUSS thread, where everyone can comment
> on how they think the list can be improved, followed by a VOTE to formally
> make the decision.
>
> Any feedback and opinions, including but not limited to the following
> aspects, will be appreciated.
>
>- Important items that are missing from the list
>- Concerns regarding the listed items or their priorities
>
> Looking forward to your feedback.
>
> Best,
>
> Xintong
>
>
> [1]https://lists.apache.org/list?dev@flink.apache.org:lte=1M:release%202.0%20status%20updates
>
> [2] https://cwiki.apache.org/confluence/display/FLINK/2.0+Release
>
>
>


[jira] [Created] (FLINK-32440) Introduce file merging configuration

2023-06-26 Thread Yanfei Lei (Jira)
Yanfei Lei created FLINK-32440:
--

 Summary: Introduce file merging configuration
 Key: FLINK-32440
 URL: https://issues.apache.org/jira/browse/FLINK-32440
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / State Backends
Affects Versions: 1.18.0
Reporter: Yanfei Lei


Introduce file merging configuration and config FileMergingSnapshotManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Flink REST API improvements

2023-06-26 Thread Hong Teoh
Thanks David for the feedback!

> CheckpointStatsCache is also populated using the "cached execution graph,"
> so there is nothing to gain from the "staleness" pov; see
> AbstractCheckpointHandler for more details.


You are right about the CheckpointStatisticsCache. Sorry I was referring to the 
“caching” done in the CheckpointStatsTracker directly, I’m not sure why I 
confusingly said “CheckpointStatsCache”. Let me clarify!

1. At the moment, the only thing that the CheckpointingStatisticsHandler 
requires from the ExecutionGraph is the CheckpointStatsSnapshot object.
2. This ExecutionGraph is the object that is cached, and is not “refreshed” 
when the contents change. This means that the CheckpointStatsSnapshot can be up 
to 3s stale.
3. We could overcome this “staleness” by reducing the cache period of the 
ExecutionGraph, however, this same cache object is used by many other handlers. 
[2] This means reducing the cache would have the following performance impact:
  - Increased RPC messages (from all handlers)
  - Incur reconstruction of the entire job graph, can be expensive for large 
graphs.
  - Also increases the Flink dashboard refresh rate (can be overcome by 
separating out the config)
4. Given the above, we could simplify the internals of 
CheckpointingStatisticsHandler to retrieve just the updated copy of the 
CheckpointStatsSnapshot object from the JobMaster directly. Since there is a 
caching in the CheckpointStatsTracker [3], we will only insure increased RPC 
messages that will be processed quickly, since there is a cache that is 
invalidated when a new checkpoint is triggered.

One concern I can think of is the increased RPC message call, but since the 
request will be resolved quickly, this should be ok.

Let me know what you think! 

[1] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/checkpoints/CheckpointingStatisticsHandler.java#L104
[2] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/WebMonitorEndpoint.java#L343-L432
[3] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java#L117-L141

> This sounds reasonable as long as it falls back to "web.refresh-interval"
> when not defined. For consistency reasons, it should be also named
> "rest.cache-timeout”

Yep, the fallback sounds good, to maintain backwards compatibility. I reckon we 
could just start with `rest.cache-timeout.default` (for future compatibility, 
for example, we could have timeouts for different caches 
`rest.cache-timeout.execution-graph` or 
`rest.cache-timeout.checkpoint-statistics`).  

> In general, I'd be in favor of this ("rest.cache-timeout" would then need
> to become "rest.default-cache-timeout"), but I need to see a detailed FLIP
> because in my mind this could get quite complicated.

I like that it uses the industry standards, but I agree, we need to think 
carefully about the multiple layers of cache we have included in the Flink JM. 
Will take a look at this.

Let me know your thoughts!

Regards,
Hong




> On 26 Jun 2023, at 13:26, David Morávek  wrote:
> 
> Hi Hong,
> 
> Thanks for starting the discussion.
> 
> seems to be using the cached version of the entire Execution graph (stale
>> data), when it could just use the CheckpointStatsCache directly
> 
> 
> CheckpointStatsCache is also populated using the "cached execution graph,"
> so there is nothing to gain from the "staleness" pov; see
> AbstractCheckpointHandler for more details.
> 
> Anyone aware of a reason we don’t do this already?
>> 
> 
> The CheckpointStatsCache is populated lazily on the request for a
> particular checkpoint (so it might not have a full view); the used data
> structure is also slightly different; one more thing is that
> CheckpointStatsCache is meant for different purpose -> keeping a particular
> checkpoint around while it's being investigated. Otherwise, it might
> expire; using it for "overview" would break this.
> 
> Configuration for web.refresh-interval controls both dashboard refresh rate
>> and ExecutionGraph cache
>> 
> 
> This sounds reasonable as long as it falls back to "web.refresh-interval"
> when not defined. For consistency reasons, it should be also named
> "rest.cache-timeout"
> 
> 
>> Cache-Control on the HTTP headers.
>> 
> 
> In general, I'd be in favor of this ("rest.cache-timeout" would then need
> to become "rest.default-cache-timeout"), but I need to see a detailed FLIP
> because in my mind this could get quite complicated.
> 
> Best,
> D.
> 
> On Fri, Jun 23, 2023 at 6:26 PM Teoh, Hong 
> wrote:
> 
>> Hi all,
>> 
>> I have been looking at the Flink REST API implementation, and had some
>> question on potential improvements. Looking to gather some thoughts:
>> 
>> 1. Only use what is necessary. The GET /checkpoints API seems to be using
>> the cached version of the entire Execution graph (stale d

Re: [DISCUSS] Persistent SQL Gateway

2023-06-26 Thread Ferenc Csaky
Hi Jark,

Thank you for pointing out FLIP-295 abouth catalog persistence, I was not aware 
the current state. Although as far as I see, that persistent catalogs are 
necessary, but not sufficient achieving a "persistent gateway".

The current implementation ties the job lifecycle to the SQL gateway session, 
so if it gets closed, it will cancel all the jobs. So that would be the next 
step I think. Any work or thought regarding this aspect? We are definitely 
willing to help out on this front.

Cheers,
F


--- Original Message ---
On Sunday, June 25th, 2023 at 06:23, Jark Wu  wrote:


>
>
> Hi Ferenc,
>
> Making SQL Gateway to be an easy-to-use platform infrastructure of Flink
> SQL
> is one of the important roadmaps [1].
>
> The persistence ability of the SQL Gateway is a major work in 1.18 release.
> One of the persistence demand is that the registered catalogs are currently
> kept in memory and lost when Gateway restarts. There is an accepted FLIP
> (FLIP-295)[2] target to resolve this issue and make Gateway can persist the
> registered catalogs information into files or databases.
>
> I'm not sure whether this is something you are looking for?
>
> Best,
> Jark
>
>
> [1]: https://flink.apache.org/roadmap/#a-unified-sql-platform
> [2]:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-295%3A+Support+lazy+initialization+of+catalogs+and+persistence+of+catalog+configurations
>
> On Fri, 23 Jun 2023 at 00:25, Ferenc Csaky ferenc.cs...@pm.me.invalid
>
> wrote:
>
> > Hello devs,
> >
> > I would like to open a discussion about persistence possibilitis for the
> > SQL Gateway. At Cloudera, we are happy to see the work already done on this
> > project and looking for ways to utilize it on our platform as well, but
> > currently it lacks some features that would be essential in our case, where
> > we could help out.
> >
> > I am not sure if any thought went into gateway persistence specifics
> > already, and this feature could be implemented in fundamentally differnt
> > ways, so I think the frist step could be to agree on the basics.
> >
> > First, in my opinion, persistence should be an optional feature of the
> > gateway, that can be enabled if desired. There can be a lot of
> > implementation details, but there can be some major directions to follow:
> >
> > - Utilize Hive catalog: The Hive catalog can already be used to have
> > persistenct meta-objects, so the crucial thing that would be missing in
> > this case is other catalogs. Personally, I would not pursue this option,
> > because in my opinion it would limit the usability of this feature too much.
> > - Serialize the session as is: Saving the whole session (or its context)
> > [1] as is to durable storage, so it can be kept and picked up again.
> > - Serialize the required elements (catalogs, tables, functions, etc.), not
> > necessarily as a whole: The main point here would be to serialize a
> > different object, so the persistent data will not be that sensitive to
> > changes of the session (or its context). There can be numerous factors
> > here, like try to keep the model close to the session itself, so the
> > boilerplate required for the mapping can be kept to minimal, or focus on
> > saving what is actually necessary, making the persistent storage more
> > portable.
> >
> > WDYT?
> >
> > Cheers,
> > F
> >
> > [1]
> > https://github.com/apache/flink/blob/master/flink-table/flink-sql-gateway/src/main/java/org/apache/flink/table/gateway/service/session/Session.java


[jira] [Created] (FLINK-32441) DefaultSchedulerTest#testTriggerCheckpointAndCompletedAfterStore fails with timeout on AZP

2023-06-26 Thread Sergey Nuyanzin (Jira)
Sergey Nuyanzin created FLINK-32441:
---

 Summary: 
DefaultSchedulerTest#testTriggerCheckpointAndCompletedAfterStore fails with 
timeout on AZP
 Key: FLINK-32441
 URL: https://issues.apache.org/jira/browse/FLINK-32441
 Project: Flink
  Issue Type: Bug
  Components: API / Core, Tests
Affects Versions: 1.18.0
Reporter: Sergey Nuyanzin


This build 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=50461&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=9274

fails with timeout on 
{{DefaultSchedulerTest#testTriggerCheckpointAndCompletedAfterStore}}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32442) DownloadPipelineArtifact fails on AZP

2023-06-26 Thread Sergey Nuyanzin (Jira)
Sergey Nuyanzin created FLINK-32442:
---

 Summary: DownloadPipelineArtifact fails on AZP
 Key: FLINK-32442
 URL: https://issues.apache.org/jira/browse/FLINK-32442
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.18.0
Reporter: Sergey Nuyanzin


DownloadPipelineArtifact fails on AZP

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=50309&view=logs&j=aa18c3f6-13b8-5f58-86bb-c1cffb239496&t=34dbf679-0f1d-54d2-de92-a83b268b346a&l=11



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-316: Introduce SQL Driver

2023-06-26 Thread Jing Ge
Hi Paul,

Thanks for driving it and thank you all for the informative discussion! The
FLIP is in good shape now. As described in the FLIP, SQL Driver will be
mainly used to run Flink SQLs in two scenarios: 1. SQL client/gateway in
application mode and 2. external system integration. Would you like to add
one section to describe(better with script/code example) how to use it in
these two scenarios from users' perspective?

NIT: the pictures have transparent background when readers click on it. It
would be great if you can replace them with pictures with white background.

Best regards,
Jing

On Mon, Jun 26, 2023 at 1:31 PM Paul Lam  wrote:

> Hi Shengkai,
>
> > * How can we ship the json plan to the JobManager?
>
> The Flink K8s module should be responsible for file distribution. We could
> introduce
> an option like `kubernetes.storage.dir`. For each flink cluster, there
> would be a
> dedicated subdirectory, with the pattern like
> `${kubernetes.storage.dir}/${cluster-id}`.
>
> All resources-related options (e.g. pipeline jars, json plans) that are
> configured with
> scheme `file://`  would be uploaded to the resource directory
> and downloaded to the
> jobmanager, before SQL Driver accesses the files with the original
> filenames.
>
>
> > * Classloading strategy
>
>
> We could directly specify the SQL Gateway jar as the jar file in
> PackagedProgram.
> It would be treated like a normal user jar and the SQL Driver is loaded
> into the user
> classloader. WDYT?
>
> > * Option `$internal.sql-gateway.driver.sql-config` is string type
> > I think it's better to use Map type here
>
> By Map type configuration, do you mean a nested map that contains all
> configurations?
>
> I hope I've explained myself well, it’s a file that contains the extra SQL
> configurations, which would be shipped to the jobmanager.
>
> > * PoC branch
>
> Sure. I’ll let you know once I get the job done.
>
> Best,
> Paul Lam
>
> > 2023年6月26日 14:27,Shengkai Fang  写道:
> >
> > Hi, Paul.
> >
> > Thanks for your update. I have a few questions about the new design:
> >
> > * How can we ship the json plan to the JobManager?
> >
> > The current design only exposes an option about the URL of the json
> plan. It seems the gateway is responsible to upload to an external stroage.
> Can we reuse the PipelineOptions.JARS to ship to the remote filesystem?
> >
> > * Classloading strategy
> >
> > Currently, the Driver is in the sql-gateway package. It means the Driver
> is not in the JM's classpath directly. Because the sql-gateway jar is now
> in the opt directory rather than lib directory. It may need to add the
> external dependencies as Python does[1]. BTW, I think it's better to move
> the Driver into the flink-table-runtime package, which is much easier to
> find(Sorry for the wrong opinion before).
> >
> > * Option `$internal.sql-gateway.driver.sql-config` is string type
> >
> > I think it's better to use Map type here
> >
> > * PoC branch
> >
> > Because this FLIP involves many modules, do you have a PoC branch to
> verify it does work?
> >
> > Best,
> > Shengkai
> >
> > [1]
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L940
> <
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L940
> >
> > Paul Lam mailto:paullin3...@gmail.com>>
> 于2023年6月19日周一 14:09写道:
> > Hi Shengkai,
> >
> > Sorry for my late reply. It took me some time to update the FLIP.
> >
> > In the latest FLIP design, SQL Driver is placed in flink-sql-gateway
> module. PTAL.
> >
> > The FLIP does not cover details about the K8s file distribution, but its
> general usage would
> > be very much the same as YARN setups. We could make follow-up
> discussions in the jira
> > tickets.
> >
> > Best,
> > Paul Lam
> >
> >> 2023年6月12日 15:29,Shengkai Fang  fskm...@gmail.com>> 写道:
> >>
> >>
> >> > If it’s the case, I’m good with introducing a new module and making
> SQL Driver
> >> > an internal class and accepts JSON plans only.
> >>
> >> I rethink this again and again. I think it's better to move the
> SqlDriver into the sql-gateway module because the sql client relies on the
> sql-gateway to submit the sql and the sql-gateway has the ability to
> generate the ExecNodeGraph now. +1 to support accepting JSON plans only.
> >>
> >> * Upload configuration through command line parameter
> >>
> >> ExecNodeGraph only contains the job's information but it doesn't
> contain the checkpoint dir, checkpoint interval, execution mode and so on.
> So I think we should also upload the configuration.
> >>
> >> * KubernetesClusterDescripter and
> KubernetesApplicationClusterEntrypoint are responsible for the jar
> upload/download
> >>
> >> +1 for the change.
> >>
> >> Could you update the FLIP about the current discussion?
> >>
> >> Best,
> >> Shengkai
> >>
> >>
> >>
> >>
> >>
> >>
> >> Yang Wang mailto:wangyang0...@apache.org>>
> 于2023年6月12日周一 11:41写道:
> >> Sorry for the

Re: [VOTE] Release flink-connector-jdbc v3.1.1, release candidate #2

2023-06-26 Thread Sergey Nuyanzin
+1 (non-binding)

- verified hashes
- verified signatures
- built from sources
- checked release notes
- review web pr

Non-blocking finding:
  it seems the release date in web PR should be changed

On Mon, Jun 26, 2023 at 1:34 PM Leonard Xu  wrote:

> +1 (binding)
>
> - built from source code succeeded
> - verified signatures
> - verified hashsums
> - checked release notes
> - checked the contents contains jar and pom files in apache repo
> - reviewed the web PR
>
> Best,
> Leonard
>
> > On Jun 19, 2023, at 6:57 PM, Danny Cranmer 
> wrote:
> >
> > Thanks for driving this Martijn.
> >
> > +1 (binding)
> >
> > - Reviewed web PR
> > - Jira release notes look good
> > - Tag exists in Github
> > - Source archive signature/checksum looks good
> > - Binary (from Maven) signature/checksum looks good
> > - No binaries in the source archive
> > - Source archive builds from source and tests pass
> > - CI passes [1]
> >
> > Non blocking findings:
> > - NOTICE files year is 2022 and needs to be updated to 2023
> > - pom.xml is referencing Flink 1.17.0 and can be updated to 1.17.1
> > - Some unit tests (notably OracleExactlyOnceSinkE2eTest) appear to be
> > integration/e2e and are run in the unit test suite
> >
> > Thanks,
> > Danny
> >
> > [1]
> https://github.com/apache/flink-connector-jdbc/actions/runs/5278297177
> >
> > On Thu, Jun 15, 2023 at 12:40 PM Martijn Visser <
> martijnvis...@apache.org>
> > wrote:
> >
> >> Hi everyone,
> >> Please review and vote on the release candidate #2 for the version
> 3.1.1,
> >> as follows:
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >>
> >> The complete staging area is available for your review, which includes:
> >> * JIRA release notes [1],
> >> * the official Apache source release to be deployed to dist.apache.org
> >> [2],
> >> which are signed with the key with fingerprint
> >> A5F3BCE4CBE993573EC5966A65321B8382B219AF [3],
> >> * all artifacts to be deployed to the Maven Central Repository [4],
> >> * source code tag v3.1.1-rc2 [5],
> >> * website pull request listing the new release [6].
> >>
> >> The vote will be open for at least 72 hours. It is adopted by majority
> >> approval, with at least 3 PMC affirmative votes.
> >>
> >> Thanks,
> >> Release Manager
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353281
> >> [2]
> >>
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.1-rc2
> >> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1642
> >> [5] https://github.com/apache/flink-connector-
> >> /releases/tag/v3.1.1-rc2
> >> [6] https://github.com/apache/flink-web/pull/654
> >>
>
>

-- 
Best regards,
Sergey


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-26 Thread Becket Qin
Hi Xintong, Jark and Jing,

Thanks for the reply. Yes, we can only mark the DataStream API as
@Deprecated after the ProcessFunction API is fully functional and mature.

It is a fair point that the condition of marking a @Public API as
deprecated should also be a part of this FLIP. I just added that to the
FLIP wiki. This is probably more of a clarification on the existing
convention, rather than a change.

It looks like we are on the same page now for this FLIP. If so, I'll start
a VOTE thread in two days.

Thanks,

Jiangjie (Becket) Qin

On Mon, Jun 26, 2023 at 8:09 PM Xintong Song  wrote:

> >
> > Considering DataStream API is the most fundamental and complex API of
> > Flink, I think it is worth a longer time than the general process for the
> > deprecation period to wait for the new API be mature.
> >
>
> This inspires me. In this specific case, compared to how long should
> DataStream API be removed after deprecation, it's probably more important
> to answer the question how long would ProcessFunction API become mature and
> stable after being introduced. According to FLIP-197[1], it requires 4
> minor releases by default to promote an @Experimental API to @Public. And
> for ProcessFunction API, which aims to replace DataStream API as one of the
> most fundamental API of Flink, I'd expect this to take at least the default
> time, or even longer. And we probably should wait until we believe
> ProcessFunction API is stable to mark DataStream API as deprecated, rather
> than as soon as it's introduced. Assuming we introduce the ProcessFunction
> API in 2.0, that means we would need to wait for 6 minor releases (4 for
> the new API to become stable, and 2 for the migration period) to remove
> DataStream API, which is ~2.5 year (assuming 5 months / minor release),
> which sounds acceptable for another major version bump.
>
> To wrap things up, it seems to me, sadly, that anyway we cannot avoid the
> overhead for maintaining both DataStream & ProcessFunction APIs for at
> least 6 minor releases.
>
> Best,
>
> Xintong
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process
>
>
>
> On Mon, Jun 26, 2023 at 5:41 PM Jing Ge 
> wrote:
>
> > Hi all,
> >
> > Just want to make sure we are on the same page. There is another
> example[1]
> > I was aware of recently that shows why more factors need to be taken care
> > of than just the migration period. Thanks Galen for your hint.
> >
> > To put it simply, the concern about API deprecation is not that
> deprecated
> > APIs have been removed too early (min migration period is required). The
> > major concern is that APIs are marked as deprecated for a (too) long
> time,
> > much longer than the migration period discussed in this thread, afaik.
> > Since there is no clear picture/definition, no one knows when to do the
> > migration for users(after the migration period has expired) and when to
> > remove deprecated APIs for Flink developers.
> >
> > Based on all the information I knew, there are two kinds of obstacles
> that
> > will and should block the deprecation process:
> >
> > 1. Lack of functionalities in new APIs. It happens e.g. with the
> > SourceFunction to FLIP-27 Source migration. Users who rely on those
> > functions can not migrate to new APIs.
> > 2. new APIs have critical bugs. An example could be found at [1]. Users
> > have to stick to the deprecated APIs.
> >
> > Since FLIP-321 is focusing on the API deprecation process, those blocking
> > issues deserve attention and should be put into the FLIP. The current
> FLIP
> > seems to only focus on migration periods. If we consider those blocking
> > issues as orthogonal issues that are beyond the scope of this discussion,
> > does it make sense to change the FLIP title to something like "Introduce
> > minimum migration periods of API deprecation process"?
> >
> > Best regards,
> > Jing
> >
> > [1] https://lists.apache.org/thread/wxoo7py5pqqlz37l4w8jrq6qdvsdq5wc
> >
> > On Sun, Jun 25, 2023 at 2:01 PM Jark Wu  wrote:
> >
> > > I agree with Jingsong and Becket.
> > >
> > > Look at the legacy SourceFunction (a small part of DataStream API),
> > > the SourceFunction is still not and can't be marked deprecated[1] until
> > > now after the new Source was released 2 years ago, because the new
> Source
> > > still can't fully consume the abilities of legacy API. Considering
> > > DataStream
> > > API is the most fundamental and complex API of Flink, I think it is
> worth
> > > a longer time than the general process for the deprecation period to
> > > wait for the new API be mature. The above 2 options sound a bit of rush
> > > for such a widely used API.
> > >
> > > I fully understand the concern of maintenance overhead, but it's a bit
> > hard
> > > for others to estimate maintenance costs without a concrete design and
> > code
> > > of the new ProcessFunction API. I agree with Becket that maybe we can
> > > re-evaluate the API deprecation process once we have t

Re: [DISCUSS] Persistent SQL Gateway

2023-06-26 Thread Jark Wu
Hi Ferenc,

But the job lifecycle doesn't tie to the SQL Gateway session.
Even if the session is closed, all the running jobs are not affected.

Best,
Jark




On Tue, 27 Jun 2023 at 04:14, Ferenc Csaky 
wrote:

> Hi Jark,
>
> Thank you for pointing out FLIP-295 abouth catalog persistence, I was not
> aware the current state. Although as far as I see, that persistent catalogs
> are necessary, but not sufficient achieving a "persistent gateway".
>
> The current implementation ties the job lifecycle to the SQL gateway
> session, so if it gets closed, it will cancel all the jobs. So that would
> be the next step I think. Any work or thought regarding this aspect? We are
> definitely willing to help out on this front.
>
> Cheers,
> F
>
>
> --- Original Message ---
> On Sunday, June 25th, 2023 at 06:23, Jark Wu  wrote:
>
>
> >
> >
> > Hi Ferenc,
> >
> > Making SQL Gateway to be an easy-to-use platform infrastructure of Flink
> > SQL
> > is one of the important roadmaps [1].
> >
> > The persistence ability of the SQL Gateway is a major work in 1.18
> release.
> > One of the persistence demand is that the registered catalogs are
> currently
> > kept in memory and lost when Gateway restarts. There is an accepted FLIP
> > (FLIP-295)[2] target to resolve this issue and make Gateway can persist
> the
> > registered catalogs information into files or databases.
> >
> > I'm not sure whether this is something you are looking for?
> >
> > Best,
> > Jark
> >
> >
> > [1]: https://flink.apache.org/roadmap/#a-unified-sql-platform
> > [2]:
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-295%3A+Support+lazy+initialization+of+catalogs+and+persistence+of+catalog+configurations
> >
> > On Fri, 23 Jun 2023 at 00:25, Ferenc Csaky ferenc.cs...@pm.me.invalid
> >
> > wrote:
> >
> > > Hello devs,
> > >
> > > I would like to open a discussion about persistence possibilitis for
> the
> > > SQL Gateway. At Cloudera, we are happy to see the work already done on
> this
> > > project and looking for ways to utilize it on our platform as well, but
> > > currently it lacks some features that would be essential in our case,
> where
> > > we could help out.
> > >
> > > I am not sure if any thought went into gateway persistence specifics
> > > already, and this feature could be implemented in fundamentally
> differnt
> > > ways, so I think the frist step could be to agree on the basics.
> > >
> > > First, in my opinion, persistence should be an optional feature of the
> > > gateway, that can be enabled if desired. There can be a lot of
> > > implementation details, but there can be some major directions to
> follow:
> > >
> > > - Utilize Hive catalog: The Hive catalog can already be used to have
> > > persistenct meta-objects, so the crucial thing that would be missing in
> > > this case is other catalogs. Personally, I would not pursue this
> option,
> > > because in my opinion it would limit the usability of this feature too
> much.
> > > - Serialize the session as is: Saving the whole session (or its
> context)
> > > [1] as is to durable storage, so it can be kept and picked up again.
> > > - Serialize the required elements (catalogs, tables, functions, etc.),
> not
> > > necessarily as a whole: The main point here would be to serialize a
> > > different object, so the persistent data will not be that sensitive to
> > > changes of the session (or its context). There can be numerous factors
> > > here, like try to keep the model close to the session itself, so the
> > > boilerplate required for the mapping can be kept to minimal, or focus
> on
> > > saving what is actually necessary, making the persistent storage more
> > > portable.
> > >
> > > WDYT?
> > >
> > > Cheers,
> > > F
> > >
> > > [1]
> > >
> https://github.com/apache/flink/blob/master/flink-table/flink-sql-gateway/src/main/java/org/apache/flink/table/gateway/service/session/Session.java
>


Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing

2023-06-26 Thread yuxia
I have another round of review and left comment in your pr.

But please let's keep discuss about adding the Custom Commit Policy with 
Parameter itself in this thread instead of talk about the pr.

Best regards,
Yuxia

- 原始邮件 -
发件人: "Dunn Bangui" 
收件人: "dev" 
发送时间: 星期一, 2023年 6 月 26日 下午 5:32:17
主题: Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing

Hi yuxia,

Thank you for reviewing the code.

Specifically, when users set
*'sink.partition-commit.policy.class.parameters=''*, the 'parameters' will
not be null but will be an empty list.
In this case, it may still invoke the constructor with empty arguments.

To address this, maybe we can use the condition *'parameters != null &&
!parameters.isEmpty()'* to ensure that the 'parameters' list is not empty.
What are your thoughts on this matter? I would appreciate your thoughts. : )

Best regards,
Bangui Dunn

yuxia  于2023年6月25日周日 11:07写道:

> Hi, Bangui Dunn
> Review done.
> But please remember we should reach consensus about it in the dicsussion
> before we can merge it.
> Let's keep the discussion for a while to see any further feedback.
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "Dunn Bangui" 
> 收件人: "dev" 
> 发送时间: 星期日, 2023年 6 月 25日 上午 10:34:50
> 主题: Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
>
> Hi, Yuxia.
>
> Could you please provide me with an update on the review process? I am
> eager to move forward and would appreciate any guidance or feedback you can
> provide.
>
> Best regards, late Dragon Boat Festival blessing : )
> Bangui Dunn
>
> Gary Ccc  于2023年6月21日周三 14:44写道:
>
> > Hi, Yuxia.
> >
> > Thank you for your suggestion! I agree with your point and have made the
> > necessary modification.
> > If you have any additional feedback or further suggestions, please feel
> > free to let me know. : )
> >
> > Best regards,
> > Bangui Dunn
> >
> > yuxia  于2023年6月21日周三 10:56写道:
> >
> >> Correct what I said in the previous email:
> >>
> >> "But will it better to make it asList ? Something
> >> like:`.stringType().asList()`."  =>  "But will it better to make it no
> >> default value? Something like:`.stringType().asList().noDefaultValue`."
> >>
> >> Best regards,
> >> Yuxia
> >>
> >> - 原始邮件 -
> >> 发件人: "yuxia" 
> >> 收件人: "dev" 
> >> 发送时间: 星期三, 2023年 6 月 21日 上午 10:25:47
> >> 主题: Re: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
> >>
> >> Hi, Bangui Dunn.
> >> Thanks for reaching us out.
> >> Generally + 1 for the configuration.  But will it better to make it
> >> asList? Something like:
> >> `.stringType().asList()`.
> >>
> >> Best regards,
> >> Yuxia
> >>
> >> - 原始邮件 -
> >> 发件人: "Gary Ccc" 
> >> 收件人: "dev" 
> >> 发送时间: 星期二, 2023年 6 月 20日 下午 5:44:10
> >> 主题: [DISCUSS] Flexible Custom Commit Policy with Parameter Passing
> >>
> >> To the Apache Flink Community Members,
> >> I hope this email finds you well. I am writing to discuss a potential
> >> improvement for the implementation of a custom commit policy in Flink.
> >>
> >> Background:
> >> We have encountered challenges in utilizing a custom commit policy due
> to
> >> the inability to pass parameters.
> >> This limitation restricts our ability to add additional functionality to
> >> the commit policy, such as monitoring the files associated with each
> >> commit.
> >>
> >> Purpose:
> >> The purpose of this improvement is to allow the passing of parameters to
> >> the custom PartitionCommitPolicy. By enabling parameter passing, users
> can
> >> extend the functionality of their custom commit policy.
> >>
> >> Example:
> >> Suppose we have a custom commit policy called "MyPolicy" that requires
> >> parameters such as "key" and "url" for proper functionality.
> >> Currently, it is not possible to pass these parameters when using a
> custom
> >> commit policy.
> >> However, by introducing the concept of
> >> SINK_PARTITION_COMMIT_POLICY_CLASS_PARAMETERS, users can now pass
> >> parameters in the following way:
> >>
> >> By adding SINK_PARTITION_COMMIT_POLICY_CLASS_PARAMETERS, you can pass
> >> parameters when using a custom commit policy, for example
> >> 'sink.partition-commit.policy.kind'='custom',
> >> 'sink.partition-commit.policy.class’='MyPolicy',
> >> 'sink.partition-commit.policy.class.parameters’=‘key;url'
> >>
> >> Eeffect:
> >> By adding PartitionCommitPolicyFactory constructor, to ensure backward
> >> compatibility for existing user programs.
> >>
> >> Code PR:
> >> To support this improvement, I have submitted a pull request with the
> >> necessary code changes.
> >> You can find the details of the pull request at the following link:
> >> https://github.com/apache/flink/pull/22831/files
> >>
> >> Best regards,
> >> Bangui Dunn
> >>
> >
>


Re:Re: [DISCUSS] FLIP-303: Support REPLACE TABLE AS SELECT statement

2023-06-26 Thread Mang Zhang
Hi yuxia,


+1 for this new feature.
In particular, the CREATE OR REPLACE TABLE syntax is more usable and faster for 
users.




--

Best regards,
Mang Zhang





At 2023-06-26 09:46:40, "yuxia"  wrote:
>Hi, folks.
>To save the time of reviewers, I would like to summary the main changes of 
>this FLIP[1]. The FLIP is just to introduce REPLACE TABLE AS SELECT statement 
>which is almost similar to CREATE TABLE AS SELECT statement, and a syntax 
>CREATE OR REPLACE TABLE AS to wrap both. This FLIP is try to complete such 
>kinds of statement. 
>
>The changes are as follows:
>1: Add enum REPLACE_TABLE_AS, CREATE_OR_REPLACE_TABLE_AS in StagingPurpose 
>which is proposed in FLIP-305[2].
>
>2: Change the configuration from `table.ctas.atomicity-enabled` proposed in 
>FLIP-305[2] to `table.rtas-ctas.atomicity-enabled` to make it take effect not 
>only for create table as, but for replace table as && create or replace table 
>as. The main reason is that these statements are almost same which belongs to 
>same statement family and I would not like to introduce a new different 
>configuration which actually do the same thing. Also, IIRC, in the offline 
>dicussion about FLIP-218[1], it also wants to introduce 
>`table.rtas-ctas.atomicity-enabled`, but as FLIP-218 is only to support CTAS, 
>it's not suitable to introduce a configuration implying rtas which is not 
>supported. So, we change the configuration to `table.ctas.atomicity-enabled`. 
>Since CTAS has been supported, I think it's reasonable to revist it and 
>introduce `table.rtas-ctas.atomicity-enabled` a to unify them in this FLIP for 
>supporting REPLACE TABLE AS statement.
>
>
>Again, look forward to your feedback.
>
>[1] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> 
>[2] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> 
>[3] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185 
>
>Best regards,
>Yuxia
>
>- 原始邮件 -
>发件人: "yuxia" 
>收件人: "dev" 
>发送时间: 星期四, 2023年 6 月 15日 下午 7:58:27
>主题: [DISCUSS] FLIP-303: Support REPLACE TABLE AS SELECT statement
>
>Hi, devs. 
>As the FLIPs FLIP-218[1] & FLIP-305[2] for Flink to supports CREATE TABLE AS 
>SELECT statement has been accepted. 
>I would like to start a discussion about FLIP-303: Support REPLACE TABLE AS 
>SELECT+statement[3] to complete such kinds of statements. 
>With REPLACE TABLE AS SELECT statement, users won't need to drop the table 
>firstly, and use CREATE TABLE AS SELECT then. Since the statement is much 
>similar to CREATE TABLE AS statement, the design is much similar to 
>FLIP-218[1] & FLIP-305[2] apart from some parts specified to REPLACE TABLE AS 
>SELECT statement. 
>Just kindly remind, to understand this FLIP better, you may need read 
>FLIP-218[1] & FLIP-305[2] to get more context. 
>
>Look forward to your feedback. 
>
>[1]: 
>https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185 
>[2]: 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> 
>[3]: 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> 
>
>:) just notice I miss "[DISCUSS]" in the title of the previous email [4], so I 
>send it again here with the correct email title. Please ignore the previous 
>email and discuss in this thread. 
>Sorry for the noise. 
>
>[4]: https://lists.apache.org/thread/jy39xwxn1o2035y5411xynwtbyfgg76t 
>
>
>Best regards, 
>Yuxia


[jira] [Created] (FLINK-32443) Translate "State Processor API" page into Chinese

2023-06-26 Thread Yanfei Lei (Jira)
Yanfei Lei created FLINK-32443:
--

 Summary: Translate "State Processor API" page into Chinese
 Key: FLINK-32443
 URL: https://issues.apache.org/jira/browse/FLINK-32443
 Project: Flink
  Issue Type: Improvement
  Components: API / State Processor, chinese-translation, Documentation
Affects Versions: 1.18.0
Reporter: Yanfei Lei


The page URL is 
[https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/libs/state_processor_api/]

The markdown file is located in docs/content.zh/docs/libs/state_processor_api.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32444) Enable object reuse for Flink SQL jobs by default

2023-06-26 Thread Jark Wu (Jira)
Jark Wu created FLINK-32444:
---

 Summary: Enable object reuse for Flink SQL jobs by default
 Key: FLINK-32444
 URL: https://issues.apache.org/jira/browse/FLINK-32444
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Reporter: Jark Wu
 Fix For: 1.18.0


Currently, object reuse is not enabled by default for Flink Streaming Jobs, but 
is enabled by default for Flink Batch jobs. That is not consistent for 
stream-batch unification. Besides, SQL operators are safe to enable object 
reuse and this is a great performance improvement for SQL jobs. 

We should also be careful with the Table-DataStream conversion case 
(StreamTableEnvironment) which is not safe to enable object reuse by default. 
Maybe we can just enable it for SQL Client/Gateway and TableEnvironment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-303: Support REPLACE TABLE AS SELECT statement

2023-06-26 Thread yuxia
Hi, all. 
Thanks for the feedback. 

If there are no other questions or concerns for the FLIP[1], I'd like to start 
the vote tomorrow (6.28). 

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
 

Best regards, 
Yuxia 


发件人: "zhangmang1"  
收件人: "dev" , luoyu...@alumni.sjtu.edu.cn 
发送时间: 星期二, 2023年 6 月 27日 下午 12:03:35 
主题: Re:Re: [DISCUSS] FLIP-303: Support REPLACE TABLE AS SELECT statement 

Hi yuxia, 

+1 for this new feature. 
In particular, the CREATE OR REPLACE TABLE syntax is more usable and faster for 
users. 





-- 
Best regards, 
Mang Zhang 




At 2023-06-26 09:46:40, "yuxia"  wrote:
>Hi, folks.
>To save the time of reviewers, I would like to summary the main changes of 
>this FLIP[1]. The FLIP is just to introduce REPLACE TABLE AS SELECT statement 
>which is almost similar to CREATE TABLE AS SELECT statement, and a syntax 
>CREATE OR REPLACE TABLE AS to wrap both. This FLIP is try to complete such 
>kinds of statement. 
>
>The changes are as follows:
>1: Add enum REPLACE_TABLE_AS, CREATE_OR_REPLACE_TABLE_AS in StagingPurpose 
>which is proposed in FLIP-305[2].
>
>2: Change the configuration from `table.ctas.atomicity-enabled` proposed in 
>FLIP-305[2] to `table.rtas-ctas.atomicity-enabled` to make it take effect not 
>only for create table as, but for replace table as && create or replace table 
>as. The main reason is that these statements are almost same which belongs to 
>same statement family and I would not like to introduce a new different 
>configuration which actually do the same thing. Also, IIRC, in the offline 
>dicussion about FLIP-218[1], it also wants to introduce 
>`table.rtas-ctas.atomicity-enabled`, but as FLIP-218 is only to support CTAS, 
>it's not suitable to introduce a configuration implying rtas which is not 
>supported. So, we change the configuration to `table.ctas.atomicity-enabled`. 
>Since CTAS has been supported, I think it's reasonable to revist it and 
>introduce `table.rtas-ctas.atomicity-enabled` a to unify them in this FLIP for 
>supporting REPLACE TABLE AS statement.
>
>
>Again, look forward to your feedback.
>
>[1] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> 
>[2] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> 
>[3] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185 
>
>Best regards,
>Yuxia
>
>- 原始邮件 -
>发件人: "yuxia" 
>收件人: "dev" 
>发送时间: 星期四, 2023年 6 月 15日 下午 7:58:27
>主题: [DISCUSS] FLIP-303: Support REPLACE TABLE AS SELECT statement
>
>Hi, devs. 
>As the FLIPs FLIP-218[1] & FLIP-305[2] for Flink to supports CREATE TABLE AS 
>SELECT statement has been accepted. 
>I would like to start a discussion about FLIP-303: Support REPLACE TABLE AS 
>SELECT+statement[3] to complete such kinds of statements. 
>With REPLACE TABLE AS SELECT statement, users won't need to drop the table 
>firstly, and use CREATE TABLE AS SELECT then. Since the statement is much 
>similar to CREATE TABLE AS statement, the design is much similar to 
>FLIP-218[1] & FLIP-305[2] apart from some parts specified to REPLACE TABLE AS 
>SELECT statement. 
>Just kindly remind, to understand this FLIP better, you may need read 
>FLIP-218[1] & FLIP-305[2] to get more context. 
>
>Look forward to your feedback. 
>
>[1]: 
>https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185 
>[2]: 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> 
>[3]: 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> 
>
>:) just notice I miss "[DISCUSS]" in the title of the previous email [4], so I 
>send it again here with the correct email title. Please ignore the previous 
>email and discuss in this thread. 
>Sorry for the noise. 
>
>[4]: https://lists.apache.org/thread/jy39xwxn1o2035y5411xynwtbyfgg76t 
>
>
>Best regards, 
>Yuxia 



Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

2023-06-26 Thread Dong Lin
Thank you Leonard for the review!

Hi Piotr, do you have any comments on the latest proposal?

I am wondering if it is OK to start the voting thread this week.

On Mon, Jun 26, 2023 at 4:10 PM Leonard Xu  wrote:

> Thanks Dong for driving this FLIP forward!
>
> Introducing  `backlog status` concept for flink job makes sense to me as
> following reasons:
>
> From concept/API design perspective, it’s more general and natural than
> above proposals as it can be used in HybridSource for bounded records, CDC
> Source for history snapshot and general sources like KafkaSource for
> historical messages.
>
> From user cases/requirements, I’ve seen many users manually to set larger
> checkpoint interval during backfilling and then set a shorter checkpoint
> interval for real-time processing in their production environments as a
> flink application optimization. Now, the flink framework can make this
> optimization no longer require the user to set the checkpoint interval and
> restart the job multiple times.
>
> Following supporting using larger checkpoint for job under backlog status
> in current FLIP, we can explore supporting larger parallelism/memory/cpu
> for job under backlog status in the future.
>
> In short, the updated FLIP looks good to me.
>
>
> Best,
> Leonard
>
>
> > On Jun 22, 2023, at 12:07 PM, Dong Lin  wrote:
> >
> > Hi Piotr,
> >
> > Thanks again for proposing the isProcessingBacklog concept.
> >
> > After discussing with Becket Qin and thinking about this more, I agree it
> > is a better idea to add a top-level concept to all source operators to
> > address the target use-case.
> >
> > The main reason that changed my mind is that isProcessingBacklog can be
> > described as an inherent/nature attribute of every source instance and
> its
> > semantics does not need to depend on any specific checkpointing policy.
> > Also, we can hardcode the isProcessingBacklog behavior for the sources we
> > have considered so far (e.g. HybridSource and MySQL CDC source) without
> > asking users to explicitly configure the per-source behavior, which
> indeed
> > provides better user experience.
> >
> > I have updated the FLIP based on the latest suggestions. The latest FLIP
> no
> > longer introduces per-source config that can be used by end-users. While
> I
> > agree with you that CheckpointTrigger can be a useful feature to address
> > additional use-cases, I am not sure it is necessary for the use-case
> > targeted by FLIP-309. Maybe we can introduce CheckpointTrigger separately
> > in another FLIP?
> >
> > Can you help take another look at the updated FLIP?
> >
> > Best,
> > Dong
> >
> >
> >
> > On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski 
> > wrote:
> >
> >> Hi Dong,
> >>
> >>> Suppose there are 1000 subtask and each subtask has 1% chance of being
> >>> "backpressured" at a given time (due to random traffic spikes). Then at
> >> any
> >>> given time, the chance of the job
> >>> being considered not-backpressured = (1-0.01)^1000. Since we evaluate
> the
> >>> backpressure metric once a second, the estimated time for the job
> >>> to be considered not-backpressured is roughly 1 / ((1-0.01)^1000) =
> 23163
> >>> sec = 6.4 hours.
> >>>
> >>> This means that the job will effectively always use the longer
> >>> checkpointing interval. It looks like a real concern, right?
> >>
> >> Sorry I don't understand where you are getting those numbers from.
> >> Instead of trying to find loophole after loophole, could you try to
> think
> >> how a given loophole could be improved/solved?
> >>
> >>> Hmm... I honestly think it will be useful to know the APIs due to the
> >>> following reasons.
> >>
> >> Please propose something. I don't think it's needed.
> >>
> >>> - For the use-case mentioned in FLIP-309 motivation section, would the
> >> APIs
> >>> of this alternative approach be more or less usable?
> >>
> >> Everything that you originally wanted to achieve in FLIP-309, you could
> do
> >> as well in my proposal.
> >> Vide my many mentions of the "hacky solution".
> >>
> >>> - Can these APIs reliably address the extra use-case (e.g. allow
> >>> checkpointing interval to change dynamically even during the unbounded
> >>> phase) as it claims?
> >>
> >> I don't see why not.
> >>
> >>> - Can these APIs be decoupled from the APIs currently proposed in
> >> FLIP-309?
> >>
> >> Yes
> >>
> >>> For example, if the APIs of this alternative approach can be decoupled
> >> from
> >>> the APIs currently proposed in FLIP-309, then it might be reasonable to
> >>> work on this extra use-case with a more advanced/complicated design
> >>> separately in a followup work.
> >>
> >> As I voiced my concerns previously, the current design of FLIP-309 would
> >> clog the public API and in the long run confuse the users. IMO It's
> >> addressing the
> >> problem in the wrong place.
> >>
> >>> Hmm.. do you mean we can do the following:
> >>> - Have all source operators emit a metric named "processingBacklog".
> >>> - Add a job-level config that sp