Thanks for the write-up Yanfei. It sounds reasonable to me, too. Moving
this into a dedicated discussion thread with the wiki article as the base
for that discussion makes sense. There is already a performance benchmark
article which could be extended [1]. We're working on documentation about
Flink release management [2] and what's expected from the release managers.
We're planning to finalize this documentation with 1.17. This one could be
extended later on as well if we decide to make monitoring the performance
tests a responsibility of the release managers.

Thanks for volunteering to help us with the Flink 1.17 release. Feel free
to join the weekly calls every Tuesday at 9am CET. We will see whether
that's a reasonable approach to address performance regressions.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
[2]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management

On Tue, Jan 17, 2023 at 2:53 PM Martijn Visser <martijnvis...@apache.org>
wrote:

> Hi Yanfei,
>
> Thanks for the proposal! Like Yuan mentioned, let's start a new discussion
> thread to get a clean discussion of your proposal, but it already sounds
> good to me.
>
> Best regards,
>
> Martijn
>
> Op di 17 jan. 2023 om 10:41 schreef Yuan Mei <yuanmei.w...@gmail.com>:
>
> > Hey Yanfei,
> >
> > Thanks so much for the efforts driving the whole process. It's great to
> see
> > that the performance benchmarks are indeed useful to help find
> regressions.
> >
> > 1. Regarding the procedure of how to use and understand the notification
> > reported from the slack channel #flink-dev-benchmarks, the instructions
> > read reasonably to me, and we can iterate over it gradually. Once you've
> > done the wiki change, please ping me and I can help review it.
> >
> > 2. It also sounds to me reasonable to incorporate the
> performance-watching
> > procedure into the release managers' daily/weekly monitors. But since it
> > involves a change to the standard routine of releasing, we need to
> discuss
> > and vote on the change.
> >
> > My suggestion is to start a new discussion thread for the instructions
> and
> > proposed change so that more people are aware of the proposal and join
> the
> > discussion (this is an announcement thread :-)).
> >
> >
> > Best
> > Yuan
> >
> > On Mon, Jan 16, 2023 at 4:52 PM Qingsheng Ren <renqs...@gmail.com>
> wrote:
> >
> > > Thanks for making this detailed guide, Yanfei! This is quite helpful
> for
> > > release managers to monitor and manage performance regressions.
> > >
> > > I think it will be great to also document the threshold of alerts sent
> to
> > > the Slack channel, and some related formula used in the test, either in
> > the
> > > wiki page or in the README of flink-benchmarks repo. This could help
> > other
> > > maintainers to interpret the result.
> > >
> > > Also we can add this to release managers' daily monitors, similar to CI
> > > instabilities. We can start operating with the process proposed by
> > Yanfei,
> > > and complete it gradually once we find something to add.
> > >
> > > Best regards,
> > > Qingsheng
> > >
> > > On Mon, Jan 16, 2023 at 12:08 PM Yanfei Lei <fredia...@gmail.com>
> wrote:
> > >
> > > > Hi devs,
> > > >
> > > > Flink benchmarks are periodically executed on
> > > > http://codespeed.dak8s.net:8080 to monitor Flink performance. In
> late
> > > > Oct'22, a new slack channel #flink-dev-benchmarks was created for
> > > > notifications of performance regressions. It helped us find 2 build
> > > > failures[1,2] and 5 performance regressions[3,4,5,6,7] in the past 3
> > > > months, which is very meaningful for ensuring the quality of the
> code.
> > > > I am checking the slack notifications once a week now, and if more
> > > > people come to monitor together, we can check once a day in the
> future
> > > > to find out regressions in a timely manner.
> > > >
> > > > According to some contributors and my own experience, I have
> > > > summarized a document on how to handle performance regressions. The
> > > > following is just a draft, which can be continuously iterated and
> > > > improved later.
> > > >
> > > > When a benchmark regression is detected, the following steps will
> help
> > > > to deal with regressions:
> > > >
> > > > 1. Create a Jira ticket(one per group of related benchmarks). Set
> > > > effects and fix versions to the current Flink version,
> > > > component=Benchmarks, type=Bug.
> > > >
> > > > 2. Post the ticket in the slack channel(replying in a thread).
> > > >
> > > > 3. Verify that the regression is real and investigate the cause. Take
> > > > FLINK-30623[5] as an example:
> > > >
> > > >     3.1 Inspect the timeline following the
> > > > link(
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=checkpointSingleInput.UNALIGNED&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > )
> > > > from the notification. Suspicious commit ranges can be obtained from
> > > > the figure, for this example, the suspicious range is
> > > > 13ef498172b...fb272D2cdebf.
> > > >
> > > >     3.2 Narrow down the commit range via git log. You can directly
> > > > locate a specific commit based on experience or compare the benchmark
> > > > results of each commit in this range, a commit would be found if this
> > > > regression is real. See instructions for using benchmark-request, you
> > > > can also try to benchmark locally. http://codespeed.dak8s.net:8080
> > > > benchmarking infrastructure is hosted using resources provided by
> > > > Ververica(Alibaba) and maintained by PMCs and Ververica, please
> > > > contact one of Apache Flink PMCs to get access. For example, two
> > > > benchmark requests had been submitted to verify whether FLINK-30533
> > > > caused the regression.
> > > >
> > > > > Before FLINK-30533:
> > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/177
> > > > >
> > > > > - checkpointSingleInput.UNALIGNED: 333.635178(+-8.169488)
> > > > >
> > > > > - checkpointSingleInput.UNALIGNED_1: 213.837107(+-7.282883)
> > > > >
> > > > > # After FLINK-30533:
> > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/178
> > > > >
> > > > > - checkpointSingleInput.UNALIGNED: 61.536982(+-3.581509)
> > > > >
> > > > > - checkpointSingleInput.UNALIGNED_1: 38.207438(+-2.937051)
> > > >
> > > >     3.3 Changes in flink-benchmarks[8] may also cause a regression,
> > > > don't forget to check if flink-benchmarks have changed recently.
> > > >
> > > >     3.4 If a regression cannot be reproduced stably which is caused
> by
> > > > the error in results or the issues of physical machines (like
> > > > FLINK-18614[9]), this means the regression is not real.
> > > >
> > > > 4. Post benchmark results under the Jira ticket, and ping the authors
> > > > of the commit(or relevant developers) to investigate the regression
> if
> > > > the regression is real. Otherwise, set the resolution of Jira ticket
> > > > as "Not a bug", post the conclusion and close the ticket.
> > > >
> > > > 5. If a regression is not fixed within a week of confirming that one
> > > > commit is the root cause of the regression, contact the release
> > > > manager to revert it (after confirming that reverting the changes
> > > > resolves the issue using benchmark-request[10]).
> > > >
> > > > If the above process is considered acceptable, I can draft a version
> > > > and put it in the community wiki[10]. @Matthias had proposed to
> > > > incorporate performance regression monitoring into the release
> > > > management, and make the regression testing be monitored regularly by
> > > > release managers or volunteers. I‘m glad to be one of the volunteers.
> > > >
> > > > Hope to hear your advice and opinions!
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > [2] https://issues.apache.org/jira/browse/FLINK-30015
> > > > [3] https://issues.apache.org/jira/browse/FLINK-29886
> > > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > [5] https://issues.apache.org/jira/browse/FLINK-30623
> > > > [6] https://issues.apache.org/jira/browse/FLINK-30624
> > > > [7] https://issues.apache.org/jira/browse/FLINK-30625
> > > > [8] https://github.com/apache/flink-benchmarks
> > > > [9] https://issues.apache.org/jira/browse/FLINK-18614
> > > > [10]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
> > > >
> > > > Best regards,
> > > > Yanfei
> > > > Ververica(Alibaba)
> > > >
> > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Thanks for the reminder.
> > > > >
> > > > > @Matthias
> > > > >
> > > > > any updates on the performance tests? ...or more specifically, any
> > > > updates
> > > > > on the script for alerting on performance regressions?
> > > > >
> > > > >
> > > > > I create a PR for FLINK-27571[1] but it's still under review, would
> > you
> > > > like to help take a look?
> > > > >
> > > > > FLINK-27571 is just for the new benchmarks, for the old existing
> > > > benchmarks, their information is stored
> > > > >
> > > > > in codespeed's database which can't be updated by URL request, so I
> > > also
> > > > logged into the Jenkins master
> > > > >
> > > > > and modified the codespeed's database, currently "less is better"
> can
> > > be
> > > > displayed normally on the timeline[2].
> > > > >
> > > > >
> > > > > Does it make sense to formalize/document the process?
> > > > >
> > > > > Certainly, I'm preparing a draft to share my experience of finding
> > > > commits that caused regressions.
> > > > >
> > > > > Originally, I wanted to wait for FLINK-27571 to be merged before
> > > > starting a discussion, and I will put
> > > > >
> > > > > a draft of the document later.
> > > > >
> > > > >
> > > > > This slack channel can only provide notice of regression and some
> > > > experience on how to locate regression,
> > > > >
> > > > > but we also need some people to take action after the regression
> > > > happens. It is mainly a few people who volunteer to do these things,
> > > > >
> > > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's
> > > > contribution.
> > > > >
> > > > > As for whether to add the responsibilities to the release manager,
> I
> > > > think it needs to see other people's opinions.
> > > > >
> > > > > @Martijn
> > > > >
> > > > > Thanks for creating these tickets. For FLINK-30623 and
> > FLINK-30624[5],
> > > > @Hangxiang and I have located the corresponding commit
> > > > >
> > > > > and pinged the corresponding submitter. Regression may not be
> > avoided,
> > > I
> > > > totally do agree that this work needs to be formalized as soon as
> > > possible
> > > > to fix regressions.
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-27571
> > > > >
> > > > > [2]
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9
> > > > >
> > > > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > > >
> > > > > [4] https://issues.apache.org/jira/browse/FLINK-30623
> > > > >
> > > > > [5] https://issues.apache.org/jira/browse/FLINK-30624
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Yanfei
> > > > >
> > > > >
> > > > > Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 01:11写道:
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >> Related to Matthias' email, I've checked the notifications in the
> > > Slack
> > > > >> channel and noticed three major benchmark regressions. In the end,
> > > I've
> > > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree
> > that
> > > > this
> > > > >> work needs to be formalized as soon as possible to avoid
> > regressions.
> > > It
> > > > >> would also be great to include a process on how these regressions
> > will
> > > > be
> > > > >> fixed, because I have no idea who to ping/notify that these
> > > regressions
> > > > >> have occurred.
> > > > >>
> > > > >> Best regards,
> > > > >>
> > > > >> Martijn
> > > > >>
> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623
> > > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624
> > > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625
> > > > >>
> > > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl
> > > > >> <matthias.p...@aiven.io.invalid> wrote:
> > > > >>
> > > > >> > Hi Yanfei,
> > > > >> > any updates on the performance tests? ...or more specifically,
> any
> > > > updates
> > > > >> > on the script for alerting on performance regressions?
> > > > >> >
> > > > >> > Does it make sense to formalize/document the process? Currently,
> > the
> > > > >> > release management doesn't do anything in terms of performance
> > > > >> > test monitoring. Therefore, performance regressions are not
> > > > necessarily
> > > > >> > identified actively (in contrast to CI instabilities). Or is
> this
> > > > covered
> > > > >> > by the PMC? It would be interesting to know whether there's
> > someone
> > > to
> > > > >> > reach out to who's monitoring the regression tests regularly.
> > Would
> > > > it make
> > > > >> > sense for this person to join the release calls?
> > > > >> >
> > > > >> > Or shall we work on formalizing/documenting the process and
> > > > integrating
> > > > >> > this responsibility into what the release manager(s) are in
> charge
> > > > of? My
> > > > >> > concern with that approach is that contributors might be less
> > > willing
> > > > to
> > > > >> > volunteer in the release management if we collect everything in
> > one
> > > > role.
> > > > >> > Alternatively, we could split the release manager role up into
> > > > sub-roles
> > > > >> > that contributors can volunteer for in a release (e.g. CI
> > > monitoring,
> > > > >> > performance test monitoring, Jira maintenance, ... just coming
> up
> > > with
> > > > >> > random tasks here).
> > > > >> >
> > > > >> > Alternatively, we could leave everything as is and just respond
> if
> > > > there's
> > > > >> > some complaint. I'm curious about your (and other's) opinions.
> > > > >> >
> > > > >> > Matthias
> > > > >> >
> > > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <fredia...@gmail.com
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Martijn,
> > > > >> > >
> > > > >> > > Thanks for bringing this up.
> > > > >> > >
> > > > >> > > In the past two months, this channel has helped us find many
> > > > benchmark
> > > > >> > fail
> > > > >> > > issues, like FLINK-29883
> > > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1],
> > > > >> > > FLINK-29886 <
> https://issues.apache.org/jira/browse/FLINK-29886
> > > >[2],
> > > > >> > > FLINK-30015 <
> https://issues.apache.org/jira/browse/FLINK-30015
> > > >[3]
> > > > and
> > > > >> > > FLINK-30181 <
> https://issues.apache.org/jira/browse/FLINK-30181
> > > >[4].
> > > > I
> > > > >> > also
> > > > >> > > have tried investigating several of the frequently reported
> > > > regressions
> > > > >> > and
> > > > >> > > replied under the notification in slack channel(copy them
> here):
> > > > >> > >
> > > > >> > >    1. serializerHeavyString
> > > > >> > >    <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >:
> > > > >> > >    It is unstable for a long time, see [5]
> > > > >> > >    https://issues.apache.org/jira/browse/FLINK-27165 for
> > possible
> > > > >> > reasons.
> > > > >> > >    2. Regressions are detected by a simple script which may
> have
> > > > false
> > > > >> > >    positives and false negatives, especially for benchmarks
> with
> > > > small
> > > > >> > >    absolute values, small value changes cause large percentage
> > > > changes.
> > > > >> > see
> > > > >> > >    [6] for details.
> > > > >> > >
> > > > >> > >      Maybe slidingWindow
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >(value~=600),
> > > > >> > > stateBackends.ROCKS
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >
> > > > >> > > (value~=260) and serializerHeavyString
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >(value~=170)
> > > > >> > > are
> > > > >> > > not true regressions.
> > > > >> > >
> > > > >> > >    1. For deployAllTasks.STREAMING
> > > > >> > >    <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >,
> > > > >> > >    this benchmark result is how much time it takes to deploy
> > job,
> > > > the
> > > > >> > less
> > > > >> > >    value the better performance, see [7] for details.
> > FLINK-27571
> > > > >> > >    <https://issues.apache.org/jira/browse/FLINK-27571>[8]
> would
> > > > fix this
> > > > >> > >    problem.
> > > > >> > >
> > > > >> > >
> > > > >> > > As mentioned before, regressions are detected by a simple
> script
> > > > that is
> > > > >> > > less stable, FLINK-29825 <
> > > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9]
> > > > >> > > is created to improve the benchmark's stability. I planned to
> > > > invite more
> > > > >> > > volunteers to monitor it after the checking of regression
> became
> > > > more
> > > > >> > > stable, but I've been stuck with something else lately, sorry
> > for
> > > > the
> > > > >> > late
> > > > >> > > response.  Any suggestions on handling benchmark
> > regressions/fails
> > > > are
> > > > >> > > welcome.
> > > > >> > >
> > > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > >> > >
> > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886
> > > > >> > >
> > > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > > >> > >
> > > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > >> > >
> > > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165
> > > > >> > >
> > > > >> > > [6]
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136
> > > > >> > >
> > > > >> > > [7]
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58
> > > > >> > >
> > > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571
> > > > >> > >
> > > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825
> > > > >> > >
> > > > >> > >
> > > > >> > > Best,
> > > > >> > >
> > > > >> > > Yanfei
> > > > >> > >
> > > > >> > > Martijn Visser <martijnvis...@apache.org> 于2022年11月29日周二
> > 15:54写道:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > >
> > > > >> > > > Is there any update to be expected on the benchmark? I see
> > > > results of
> > > > >> > the
> > > > >> > > > benchmark being posted to Slack, but it appears that it's
> not
> > > > being
> > > > >> > > > monitored and no follow-up actions are being taken. I think
> > it's
> > > > >> > > currently
> > > > >> > > > lacking a process on how to interpret the results and what
> > > action
> > > > >> > should
> > > > >> > > > be taken and by whom.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > >
> > > > >> > > > Martijn
> > > > >> > > >
> > > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <j...@ververica.com
> >
> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Thanks yanfei for driving this!
> > > > >> > > > >
> > > > >> > > > > Looking forward to further discussion w.r.t. the workflow.
> > > > >> > > > >
> > > > >> > > > > Best regards,
> > > > >> > > > > Jing
> > > > >> > > > >
> > > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen <
> > > > mas.chen6...@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > +1, thanks for driving this!
> > > > >> > > > > >
> > > > >> > > > > > On a side note, can we also ensure that a performance
> > > summary
> > > > >> > report
> > > > >> > > > for
> > > > >> > > > > > Flink major version upgrades is in release notes, once
> > this
> > > > >> > > > > infrastructure
> > > > >> > > > > > becomes mature? From the user perspective, it would be
> > nice
> > > > to know
> > > > >> > > > what
> > > > >> > > > > > the expected (or unexpected) regressions in a major
> > version
> > > > upgrade
> > > > >> > > > are.
> > > > >> > > > > > I've seen the community do something like this before
> > (e.g.
> > > > the
> > > > >> > major
> > > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable
> > to
> > > > know
> > > > >> > that
> > > > >> > > > > > upfront!
> > > > >> > > > > >
> > > > >> > > > > > Best,
> > > > >> > > > > > Mason
> > > > >> > > > > >
> > > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo <
> > > > >> > > guoweijieres...@gmail.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Thanks Yanfei for driving this.
> > > > >> > > > > > >
> > > > >> > > > > > > It allows us to easily find the problem of performance
> > > > >> > regression.
> > > > >> > > > > > > Especially recently, I have made some improvements to
> > the
> > > > >> > > scheduling
> > > > >> > > > > > > related parts, your work is very important to ensure
> > that
> > > > these
> > > > >> > > > changes
> > > > >> > > > > > do
> > > > >> > > > > > > not cause some unexpected problems.
> > > > >> > > > > > >
> > > > >> > > > > > > Best regards,
> > > > >> > > > > > >
> > > > >> > > > > > > Weijie
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Congxian Qiu <qcx978132...@gmail.com> 于2022年10月28日周五
> > > > 16:03写道:
> > > > >> > > > > > >
> > > > >> > > > > > > > Thanks for driving this and making the performance
> > > > monitoring
> > > > >> > > > public,
> > > > >> > > > > > > this
> > > > >> > > > > > > > can make us know and resolve the performance problem
> > > > quickly.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Looking forward to the workflow and detailed
> > > descriptions
> > > > fo
> > > > >> > > > > > > > flink-dev-benchmarks.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Best,
> > > > >> > > > > > > > Congxian
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > Yun Tang <myas...@live.com> 于2022年10月27日周四 12:41写道:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the
> > > > performance in
> > > > >> > > the
> > > > >> > > > > > > Apache
> > > > >> > > > > > > > > Flink Slack Channel.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Look forward to the workflow and detailed
> > descriptions
> > > > of
> > > > >> > > > > > > > > flink-dev-benchmarks.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Best
> > > > >> > > > > > > > > Yun Tang
> > > > >> > > > > > > > > ________________________________
> > > > >> > > > > > > > > From: Hangxiang Yu <master...@gmail.com>
> > > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59
> > > > >> > > > > > > > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily
> Monitoring
> > > > Moved
> > > > >> > from
> > > > >> > > > > > > Ververica
> > > > >> > > > > > > > > to Apache Flink Slack Channel
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Hi, Yanfei.
> > > > >> > > > > > > > > Thanks for driving this.
> > > > >> > > > > > > > > It could help us to detect and resolve the
> > regression
> > > > problem
> > > > >> > > > > quickly
> > > > >> > > > > > > and
> > > > >> > > > > > > > > officially.
> > > > >> > > > > > > > > I'd like to join as a maintainer.
> > > > >> > > > > > > > > Looking forward to the workflow.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei <
> > > > >> > > yuanmei.w...@gmail.com
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the
> > > performance
> > > > >> > > > monitoring
> > > > >> > > > > > > > > publicly
> > > > >> > > > > > > > > > available.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Looking forward to seeing the workflow, and more
> > > > details as
> > > > >> > > > > Martijn
> > > > >> > > > > > > > > > mentioned.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Best
> > > > >> > > > > > > > > > Yuan
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser <
> > > > >> > > > > > > > martijnvis...@apache.org
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hi Yanfei Lei,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thanks for setting this up! It would be
> > > interesting
> > > > to
> > > > >> > also
> > > > >> > > > > know
> > > > >> > > > > > > > which
> > > > >> > > > > > > > > > > aspects of Flink are monitored for
> > "performance".
> > > > I'm
> > > > >> > > > assuming
> > > > >> > > > > > > there
> > > > >> > > > > > > > > are
> > > > >> > > > > > > > > > > specific pieces of functionality that are
> > > > performance
> > > > >> > > tested,
> > > > >> > > > > but
> > > > >> > > > > > > it
> > > > >> > > > > > > > > > would
> > > > >> > > > > > > > > > > be great if this would be written down
> somewhere
> > > > (next
> > > > >> > to a
> > > > >> > > > > > > procedure
> > > > >> > > > > > > > > how
> > > > >> > > > > > > > > > > to detect a regression and what should be next
> > > > steps).
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Best regards,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Martijn
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan <
> > > > >> > > > > > zakelly....@gmail.com
> > > > >> > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Hi yanfei,
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks for driving this! It's a great help.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > I would like to join as a maintainer.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Best,
> > > > >> > > > > > > > > > > > Zakelly
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei
> <
> > > > >> > > > > > fredia...@gmail.com
> > > > >> > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Hi everyone,
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a
> > > > benchmark
> > > > >> > > > channel
> > > > >> > > > > > in
> > > > >> > > > > > > > > Apache
> > > > >> > > > > > > > > > > > Flink
> > > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a
> > > > while[2].
> > > > >> > So I
> > > > >> > > > > went
> > > > >> > > > > > on
> > > > >> > > > > > > > > with
> > > > >> > > > > > > > > > > this
> > > > >> > > > > > > > > > > > > work, and created the
> #flink-dev-benchmarks
> > > > channel
> > > > >> > for
> > > > >> > > > > > > > performance
> > > > >> > > > > > > > > > > > > regression notifications.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > We have a regression report script[3] that
> > > runs
> > > > >> > daily,
> > > > >> > > > and
> > > > >> > > > > a
> > > > >> > > > > > > > > > > notification
> > > > >> > > > > > > > > > > > > would be sent to the slack channel when
> the
> > > > last few
> > > > >> > > > > > benchmark
> > > > >> > > > > > > > > > results
> > > > >> > > > > > > > > > > > are
> > > > >> > > > > > > > > > > > > significantly worse than the baseline.
> > > > >> > > > > > > > > > > > > Note, regressions are detected by a simple
> > > > script
> > > > >> > which
> > > > >> > > > may
> > > > >> > > > > > > have
> > > > >> > > > > > > > > > false
> > > > >> > > > > > > > > > > > > positives and false negatives. And all
> > > > benchmarks are
> > > > >> > > > > > executed
> > > > >> > > > > > > on
> > > > >> > > > > > > > > one
> > > > >> > > > > > > > > > > > > physical machine[4] which is provided by
> > > > >> > > > > > Ververica(Alibaba)[5],
> > > > >> > > > > > > > it
> > > > >> > > > > > > > > > > might
> > > > >> > > > > > > > > > > > > happen that hardware issues affect
> > > performance,
> > > > like
> > > > >> > > > > > > > "[FLINK-18614
> > > > >> > > > > > > > > > > > > <
> > > > https://issues.apache.org/jira/browse/FLINK-18614>]
> > > > >> > > > > > > Performance
> > > > >> > > > > > > > > > > > regression
> > > > >> > > > > > > > > > > > > 2020.07.13"[6].
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > After the migration, we need a procedure
> to
> > > > watch
> > > > >> > over
> > > > >> > > > the
> > > > >> > > > > > > entire
> > > > >> > > > > > > > > > > > > performance of Flink code together. For
> > > > example, if a
> > > > >> > > > > > > regression
> > > > >> > > > > > > > > > > > > occurs, investigating the cause and
> > resolving
> > > > the
> > > > >> > > problem
> > > > >> > > > > are
> > > > >> > > > > > > > > needed.
> > > > >> > > > > > > > > > > In
> > > > >> > > > > > > > > > > > > the past, this procedure is maintained
> > > > internally
> > > > >> > > within
> > > > >> > > > > > > > Ververica,
> > > > >> > > > > > > > > > but
> > > > >> > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > think making the procedure public would
> > > benefit
> > > > all.
> > > > >> > I
> > > > >> > > > > > > volunteer
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > > serve
> > > > >> > > > > > > > > > > > > as one of the initial maintainers, and
> would
> > > be
> > > > glad
> > > > >> > if
> > > > >> > > > > more
> > > > >> > > > > > > > > > > contributors
> > > > >> > > > > > > > > > > > > can join me. I'd also prepare some
> > guidelines
> > > > to help
> > > > >> > > > > others
> > > > >> > > > > > > get
> > > > >> > > > > > > > > > > familiar
> > > > >> > > > > > > > > > > > > with the workflow. I will start a new
> thread
> > > to
> > > > >> > discuss
> > > > >> > > > the
> > > > >> > > > > > > > > workflow
> > > > >> > > > > > > > > > > > soon.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > [1]
> > > > >> > > > > > > > >
> > > > >> > >
> https://www.mail-archive.com/dev@flink.apache.org/msg58666.html
> > > > >> > > > > > > > > > > > > [2]
> > > > >> > https://issues.apache.org/jira/browse/FLINK-28468
> > > > >> > > > > > > > > > > > > [3]
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080
> > > > >> > > > > > > > > > > > > [5]
> > > > >> > > > > > > > >
> > > > >> > >
> > https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > [6]
> > > > >> > https://issues.apache.org/jira/browse/FLINK-18614
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > > Best,
> > > > >> > > > > > > > > Hangxiang.
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >
> > > > >
> > > >
> > > >
> > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道:
> > > >
> > > >
> > > > Yanfei Lei <fredia...@gmail.com> 于2023年1月12日周四 17:46写道:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Thanks for the reminder.
> > > > >
> > > > > @Matthias
> > > > >
> > > > > any updates on the performance tests? ...or more specifically, any
> > > > updates
> > > > > on the script for alerting on performance regressions?
> > > > >
> > > > >
> > > > > I create a PR for FLINK-27571[1] but it's still under review, would
> > you
> > > > like to help take a look?
> > > > >
> > > > > FLINK-27571 is just for the new benchmarks, for the old existing
> > > > benchmarks, their information is stored
> > > > >
> > > > > in codespeed's database which can't be updated by URL request, so I
> > > also
> > > > logged into the Jenkins master
> > > > >
> > > > > and modified the codespeed's database, currently "less is better"
> can
> > > be
> > > > displayed normally on the timeline[2].
> > > > >
> > > > >
> > > > > Does it make sense to formalize/document the process?
> > > > >
> > > > > Certainly, I'm preparing a draft to share my experience of finding
> > > > commits that caused regressions.
> > > > >
> > > > > Originally, I wanted to wait for FLINK-27571 to be merged before
> > > > starting a discussion, and I will put
> > > > >
> > > > > a draft of the document later.
> > > > >
> > > > >
> > > > > This slack channel can only provide notice of regression and some
> > > > experience on how to locate regression,
> > > > >
> > > > > but we also need some people to take action after the regression
> > > > happens. It is mainly a few people who volunteer to do these things,
> > > > >
> > > > > like FLINK-30015[3] and FLINK-30623[4], many thanks for Martijn's
> > > > contribution.
> > > > >
> > > > > As for whether to add the responsibilities to the release manager,
> I
> > > > think it needs to see other people's opinions.
> > > > >
> > > > > @Martijn
> > > > >
> > > > > Thanks for creating these tickets. For FLINK-30623 and
> > FLINK-30624[5],
> > > > @Hangxiang and I have located the corresponding commit
> > > > >
> > > > > and pinged the corresponding submitter. Regression may not be
> > avoided,
> > > I
> > > > totally do agree that this work needs to be formalized as soon as
> > > possible
> > > > to fix regressions.
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-27571
> > > > >
> > > > > [2]
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?ben=createScheduler.BATCH&extr=on&quarts=on&equid=off&env=2&revs=200&exe=1,3,5,6,8,9
> > > > >
> > > > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > > >
> > > > > [4] https://issues.apache.org/jira/browse/FLINK-30623
> > > > >
> > > > > [5] https://issues.apache.org/jira/browse/FLINK-30624
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Yanfei
> > > > >
> > > > >
> > > > > Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 01:11写道:
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >> Related to Matthias' email, I've checked the notifications in the
> > > Slack
> > > > >> channel and noticed three major benchmark regressions. In the end,
> > > I've
> > > > >> decided to create Jira tickets for it [1] [2] [3] but I do agree
> > that
> > > > this
> > > > >> work needs to be formalized as soon as possible to avoid
> > regressions.
> > > It
> > > > >> would also be great to include a process on how these regressions
> > will
> > > > be
> > > > >> fixed, because I have no idea who to ping/notify that these
> > > regressions
> > > > >> have occurred.
> > > > >>
> > > > >> Best regards,
> > > > >>
> > > > >> Martijn
> > > > >>
> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-30623
> > > > >> [2] https://issues.apache.org/jira/browse/FLINK-30624
> > > > >> [3] https://issues.apache.org/jira/browse/FLINK-30625
> > > > >>
> > > > >> On Tue, Jan 10, 2023 at 1:56 PM Matthias Pohl
> > > > >> <matthias.p...@aiven.io.invalid> wrote:
> > > > >>
> > > > >> > Hi Yanfei,
> > > > >> > any updates on the performance tests? ...or more specifically,
> any
> > > > updates
> > > > >> > on the script for alerting on performance regressions?
> > > > >> >
> > > > >> > Does it make sense to formalize/document the process? Currently,
> > the
> > > > >> > release management doesn't do anything in terms of performance
> > > > >> > test monitoring. Therefore, performance regressions are not
> > > > necessarily
> > > > >> > identified actively (in contrast to CI instabilities). Or is
> this
> > > > covered
> > > > >> > by the PMC? It would be interesting to know whether there's
> > someone
> > > to
> > > > >> > reach out to who's monitoring the regression tests regularly.
> > Would
> > > > it make
> > > > >> > sense for this person to join the release calls?
> > > > >> >
> > > > >> > Or shall we work on formalizing/documenting the process and
> > > > integrating
> > > > >> > this responsibility into what the release manager(s) are in
> charge
> > > > of? My
> > > > >> > concern with that approach is that contributors might be less
> > > willing
> > > > to
> > > > >> > volunteer in the release management if we collect everything in
> > one
> > > > role.
> > > > >> > Alternatively, we could split the release manager role up into
> > > > sub-roles
> > > > >> > that contributors can volunteer for in a release (e.g. CI
> > > monitoring,
> > > > >> > performance test monitoring, Jira maintenance, ... just coming
> up
> > > with
> > > > >> > random tasks here).
> > > > >> >
> > > > >> > Alternatively, we could leave everything as is and just respond
> if
> > > > there's
> > > > >> > some complaint. I'm curious about your (and other's) opinions.
> > > > >> >
> > > > >> > Matthias
> > > > >> >
> > > > >> > On Tue, Nov 29, 2022 at 2:13 PM Yanfei Lei <fredia...@gmail.com
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Martijn,
> > > > >> > >
> > > > >> > > Thanks for bringing this up.
> > > > >> > >
> > > > >> > > In the past two months, this channel has helped us find many
> > > > benchmark
> > > > >> > fail
> > > > >> > > issues, like FLINK-29883
> > > > >> > > <https://issues.apache.org/jira/browse/FLINK-29883>[1],
> > > > >> > > FLINK-29886 <
> https://issues.apache.org/jira/browse/FLINK-29886
> > > >[2],
> > > > >> > > FLINK-30015 <
> https://issues.apache.org/jira/browse/FLINK-30015
> > > >[3]
> > > > and
> > > > >> > > FLINK-30181 <
> https://issues.apache.org/jira/browse/FLINK-30181
> > > >[4].
> > > > I
> > > > >> > also
> > > > >> > > have tried investigating several of the frequently reported
> > > > regressions
> > > > >> > and
> > > > >> > > replied under the notification in slack channel(copy them
> here):
> > > > >> > >
> > > > >> > >    1. serializerHeavyString
> > > > >> > >    <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >:
> > > > >> > >    It is unstable for a long time, see [5]
> > > > >> > >    https://issues.apache.org/jira/browse/FLINK-27165 for
> > possible
> > > > >> > reasons.
> > > > >> > >    2. Regressions are detected by a simple script which may
> have
> > > > false
> > > > >> > >    positives and false negatives, especially for benchmarks
> with
> > > > small
> > > > >> > >    absolute values, small value changes cause large percentage
> > > > changes.
> > > > >> > see
> > > > >> > >    [6] for details.
> > > > >> > >
> > > > >> > >      Maybe slidingWindow
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >(value~=600),
> > > > >> > > stateBackends.ROCKS
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >
> > > > >> > > (value~=260) and serializerHeavyString
> > > > >> > > <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >(value~=170)
> > > > >> > > are
> > > > >> > > not true regressions.
> > > > >> > >
> > > > >> > >    1. For deployAllTasks.STREAMING
> > > > >> > >    <
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200
> > > > >> > > >,
> > > > >> > >    this benchmark result is how much time it takes to deploy
> > job,
> > > > the
> > > > >> > less
> > > > >> > >    value the better performance, see [7] for details.
> > FLINK-27571
> > > > >> > >    <https://issues.apache.org/jira/browse/FLINK-27571>[8]
> would
> > > > fix this
> > > > >> > >    problem.
> > > > >> > >
> > > > >> > >
> > > > >> > > As mentioned before, regressions are detected by a simple
> script
> > > > that is
> > > > >> > > less stable, FLINK-29825 <
> > > > >> > > https://issues.apache.org/jira/browse/FLINK-29825>[9]
> > > > >> > > is created to improve the benchmark's stability. I planned to
> > > > invite more
> > > > >> > > volunteers to monitor it after the checking of regression
> became
> > > > more
> > > > >> > > stable, but I've been stuck with something else lately, sorry
> > for
> > > > the
> > > > >> > late
> > > > >> > > response.  Any suggestions on handling benchmark
> > regressions/fails
> > > > are
> > > > >> > > welcome.
> > > > >> > >
> > > > >> > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > >> > >
> > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-29886
> > > > >> > >
> > > > >> > > [3] https://issues.apache.org/jira/browse/FLINK-30015
> > > > >> > >
> > > > >> > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > >> > >
> > > > >> > > [5] https://issues.apache.org/jira/browse/FLINK-27165
> > > > >> > >
> > > > >> > > [6]
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136
> > > > >> > >
> > > > >> > > [7]
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58
> > > > >> > >
> > > > >> > > [8] https://issues.apache.org/jira/browse/FLINK-27571
> > > > >> > >
> > > > >> > > [9] https://issues.apache.org/jira/browse/FLINK-29825
> > > > >> > >
> > > > >> > >
> > > > >> > > Best,
> > > > >> > >
> > > > >> > > Yanfei
> > > > >> > >
> > > > >> > > Martijn Visser <martijnvis...@apache.org> 于2022年11月29日周二
> > 15:54写道:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > >
> > > > >> > > > Is there any update to be expected on the benchmark? I see
> > > > results of
> > > > >> > the
> > > > >> > > > benchmark being posted to Slack, but it appears that it's
> not
> > > > being
> > > > >> > > > monitored and no follow-up actions are being taken. I think
> > it's
> > > > >> > > currently
> > > > >> > > > lacking a process on how to interpret the results and what
> > > action
> > > > >> > should
> > > > >> > > > be taken and by whom.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > >
> > > > >> > > > Martijn
> > > > >> > > >
> > > > >> > > > On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <j...@ververica.com
> >
> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Thanks yanfei for driving this!
> > > > >> > > > >
> > > > >> > > > > Looking forward to further discussion w.r.t. the workflow.
> > > > >> > > > >
> > > > >> > > > > Best regards,
> > > > >> > > > > Jing
> > > > >> > > > >
> > > > >> > > > > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen <
> > > > mas.chen6...@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > +1, thanks for driving this!
> > > > >> > > > > >
> > > > >> > > > > > On a side note, can we also ensure that a performance
> > > summary
> > > > >> > report
> > > > >> > > > for
> > > > >> > > > > > Flink major version upgrades is in release notes, once
> > this
> > > > >> > > > > infrastructure
> > > > >> > > > > > becomes mature? From the user perspective, it would be
> > nice
> > > > to know
> > > > >> > > > what
> > > > >> > > > > > the expected (or unexpected) regressions in a major
> > version
> > > > upgrade
> > > > >> > > > are.
> > > > >> > > > > > I've seen the community do something like this before
> > (e.g.
> > > > the
> > > > >> > major
> > > > >> > > > > > rocksdb version bump in 1.14?) and it was quite valuable
> > to
> > > > know
> > > > >> > that
> > > > >> > > > > > upfront!
> > > > >> > > > > >
> > > > >> > > > > > Best,
> > > > >> > > > > > Mason
> > > > >> > > > > >
> > > > >> > > > > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo <
> > > > >> > > guoweijieres...@gmail.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Thanks Yanfei for driving this.
> > > > >> > > > > > >
> > > > >> > > > > > > It allows us to easily find the problem of performance
> > > > >> > regression.
> > > > >> > > > > > > Especially recently, I have made some improvements to
> > the
> > > > >> > > scheduling
> > > > >> > > > > > > related parts, your work is very important to ensure
> > that
> > > > these
> > > > >> > > > changes
> > > > >> > > > > > do
> > > > >> > > > > > > not cause some unexpected problems.
> > > > >> > > > > > >
> > > > >> > > > > > > Best regards,
> > > > >> > > > > > >
> > > > >> > > > > > > Weijie
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Congxian Qiu <qcx978132...@gmail.com> 于2022年10月28日周五
> > > > 16:03写道:
> > > > >> > > > > > >
> > > > >> > > > > > > > Thanks for driving this and making the performance
> > > > monitoring
> > > > >> > > > public,
> > > > >> > > > > > > this
> > > > >> > > > > > > > can make us know and resolve the performance problem
> > > > quickly.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Looking forward to the workflow and detailed
> > > descriptions
> > > > fo
> > > > >> > > > > > > > flink-dev-benchmarks.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Best,
> > > > >> > > > > > > > Congxian
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > Yun Tang <myas...@live.com> 于2022年10月27日周四 12:41写道:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Thanks, Yanfei for driving this to monitor the
> > > > performance in
> > > > >> > > the
> > > > >> > > > > > > Apache
> > > > >> > > > > > > > > Flink Slack Channel.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Look forward to the workflow and detailed
> > descriptions
> > > > of
> > > > >> > > > > > > > > flink-dev-benchmarks.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Best
> > > > >> > > > > > > > > Yun Tang
> > > > >> > > > > > > > > ________________________________
> > > > >> > > > > > > > > From: Hangxiang Yu <master...@gmail.com>
> > > > >> > > > > > > > > Sent: Thursday, October 27, 2022 10:59
> > > > >> > > > > > > > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > > >> > > > > > > > > Subject: Re: [ANNOUNCE] Performance Daily
> Monitoring
> > > > Moved
> > > > >> > from
> > > > >> > > > > > > Ververica
> > > > >> > > > > > > > > to Apache Flink Slack Channel
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Hi, Yanfei.
> > > > >> > > > > > > > > Thanks for driving this.
> > > > >> > > > > > > > > It could help us to detect and resolve the
> > regression
> > > > problem
> > > > >> > > > > quickly
> > > > >> > > > > > > and
> > > > >> > > > > > > > > officially.
> > > > >> > > > > > > > > I'd like to join as a maintainer.
> > > > >> > > > > > > > > Looking forward to the workflow.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei <
> > > > >> > > yuanmei.w...@gmail.com
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Thanks, Yanfei, to drive this and make the
> > > performance
> > > > >> > > > monitoring
> > > > >> > > > > > > > > publicly
> > > > >> > > > > > > > > > available.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Looking forward to seeing the workflow, and more
> > > > details as
> > > > >> > > > > Martijn
> > > > >> > > > > > > > > > mentioned.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Best
> > > > >> > > > > > > > > > Yuan
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser <
> > > > >> > > > > > > > martijnvis...@apache.org
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hi Yanfei Lei,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thanks for setting this up! It would be
> > > interesting
> > > > to
> > > > >> > also
> > > > >> > > > > know
> > > > >> > > > > > > > which
> > > > >> > > > > > > > > > > aspects of Flink are monitored for
> > "performance".
> > > > I'm
> > > > >> > > > assuming
> > > > >> > > > > > > there
> > > > >> > > > > > > > > are
> > > > >> > > > > > > > > > > specific pieces of functionality that are
> > > > performance
> > > > >> > > tested,
> > > > >> > > > > but
> > > > >> > > > > > > it
> > > > >> > > > > > > > > > would
> > > > >> > > > > > > > > > > be great if this would be written down
> somewhere
> > > > (next
> > > > >> > to a
> > > > >> > > > > > > procedure
> > > > >> > > > > > > > > how
> > > > >> > > > > > > > > > > to detect a regression and what should be next
> > > > steps).
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Best regards,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Martijn
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan <
> > > > >> > > > > > zakelly....@gmail.com
> > > > >> > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Hi yanfei,
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks for driving this! It's a great help.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > I would like to join as a maintainer.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Best,
> > > > >> > > > > > > > > > > > Zakelly
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei
> <
> > > > >> > > > > > fredia...@gmail.com
> > > > >> > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Hi everyone,
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > As discussed earlier, we plan to create a
> > > > benchmark
> > > > >> > > > channel
> > > > >> > > > > > in
> > > > >> > > > > > > > > Apache
> > > > >> > > > > > > > > > > > Flink
> > > > >> > > > > > > > > > > > > slack[1], but the plan was shelved for a
> > > > while[2].
> > > > >> > So I
> > > > >> > > > > went
> > > > >> > > > > > on
> > > > >> > > > > > > > > with
> > > > >> > > > > > > > > > > this
> > > > >> > > > > > > > > > > > > work, and created the
> #flink-dev-benchmarks
> > > > channel
> > > > >> > for
> > > > >> > > > > > > > performance
> > > > >> > > > > > > > > > > > > regression notifications.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > We have a regression report script[3] that
> > > runs
> > > > >> > daily,
> > > > >> > > > and
> > > > >> > > > > a
> > > > >> > > > > > > > > > > notification
> > > > >> > > > > > > > > > > > > would be sent to the slack channel when
> the
> > > > last few
> > > > >> > > > > > benchmark
> > > > >> > > > > > > > > > results
> > > > >> > > > > > > > > > > > are
> > > > >> > > > > > > > > > > > > significantly worse than the baseline.
> > > > >> > > > > > > > > > > > > Note, regressions are detected by a simple
> > > > script
> > > > >> > which
> > > > >> > > > may
> > > > >> > > > > > > have
> > > > >> > > > > > > > > > false
> > > > >> > > > > > > > > > > > > positives and false negatives. And all
> > > > benchmarks are
> > > > >> > > > > > executed
> > > > >> > > > > > > on
> > > > >> > > > > > > > > one
> > > > >> > > > > > > > > > > > > physical machine[4] which is provided by
> > > > >> > > > > > Ververica(Alibaba)[5],
> > > > >> > > > > > > > it
> > > > >> > > > > > > > > > > might
> > > > >> > > > > > > > > > > > > happen that hardware issues affect
> > > performance,
> > > > like
> > > > >> > > > > > > > "[FLINK-18614
> > > > >> > > > > > > > > > > > > <
> > > > https://issues.apache.org/jira/browse/FLINK-18614>]
> > > > >> > > > > > > Performance
> > > > >> > > > > > > > > > > > regression
> > > > >> > > > > > > > > > > > > 2020.07.13"[6].
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > After the migration, we need a procedure
> to
> > > > watch
> > > > >> > over
> > > > >> > > > the
> > > > >> > > > > > > entire
> > > > >> > > > > > > > > > > > > performance of Flink code together. For
> > > > example, if a
> > > > >> > > > > > > regression
> > > > >> > > > > > > > > > > > > occurs, investigating the cause and
> > resolving
> > > > the
> > > > >> > > problem
> > > > >> > > > > are
> > > > >> > > > > > > > > needed.
> > > > >> > > > > > > > > > > In
> > > > >> > > > > > > > > > > > > the past, this procedure is maintained
> > > > internally
> > > > >> > > within
> > > > >> > > > > > > > Ververica,
> > > > >> > > > > > > > > > but
> > > > >> > > > > > > > > > > > we
> > > > >> > > > > > > > > > > > > think making the procedure public would
> > > benefit
> > > > all.
> > > > >> > I
> > > > >> > > > > > > volunteer
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > > serve
> > > > >> > > > > > > > > > > > > as one of the initial maintainers, and
> would
> > > be
> > > > glad
> > > > >> > if
> > > > >> > > > > more
> > > > >> > > > > > > > > > > contributors
> > > > >> > > > > > > > > > > > > can join me. I'd also prepare some
> > guidelines
> > > > to help
> > > > >> > > > > others
> > > > >> > > > > > > get
> > > > >> > > > > > > > > > > familiar
> > > > >> > > > > > > > > > > > > with the workflow. I will start a new
> thread
> > > to
> > > > >> > discuss
> > > > >> > > > the
> > > > >> > > > > > > > > workflow
> > > > >> > > > > > > > > > > > soon.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > [1]
> > > > >> > > > > > > > >
> > > > >> > >
> https://www.mail-archive.com/dev@flink.apache.org/msg58666.html
> > > > >> > > > > > > > > > > > > [2]
> > > > >> > https://issues.apache.org/jira/browse/FLINK-28468
> > > > >> > > > > > > > > > > > > [3]
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > > >> > > > > > > > > > > > > [4] http://codespeed.dak8s.net:8080
> > > > >> > > > > > > > > > > > > [5]
> > > > >> > > > > > > > >
> > > > >> > >
> > https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > [6]
> > > > >> > https://issues.apache.org/jira/browse/FLINK-18614
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > > Best,
> > > > >> > > > > > > > > Hangxiang.
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Yanfei
> > > >
> > > >
> > >
> >
>

Reply via email to