Re: [DISCUSS] Incorporate performance regression monitoring into routine process

David Anderson Fri, 03 Feb 2023 08:26:44 -0800

+1

I don't have anything substantive to add, but I want to express how pleased
I am to see this conversation happening.


David

On Thu, Feb 2, 2023 at 5:09 AM Martijn Visser <martijnvis...@apache.org>
wrote:

> Hi all,
>
> +1 for the overall proposal. My feedback matches with what Matthias
> has already provided earlier.
>
> Best regards,
>
> Martijn
>
> Op di 31 jan. 2023 om 19:01 schreef Matthias Pohl
> <matthias.p...@aiven.io.invalid>:
> >
> > Thanks for the effort you put into this discussion, Yanfei.
> >
> > For me, the regression tests are similar to test instabilities. In this
> > sense, I agree with what Piotr and Jing said about it:
> > - It needs to be identified and fixed as soon as possible to avoid the
> > change affecting other contributions (e.g. hiding other regressions) and
> > making it harder to revert them.
> > - Contributors should be made aware of regressions that are caused by
> their
> > commits during the daily development. They should be enabled to
> > pro-actively react to instabilities/regressions that were caused by their
> > contributions. The available Slack channel and ML lists provide good
> tools
> > to enable contributors to be proactive.
> >
> > My experience is that contributors are quite proactive when it comes to
> > resolving test instabilities on master. Still, a dedicated role for
> > watching over CI is necessary to identify issues that slipped through.
> The
> > same applies to performance tests. Eventhough, I could imagine that for
> > performance tests, the pro-activeness of contributors is lower because
> > quite a few changes are just not affecting the performance. One idea  to
> > raise awareness might be to mention the performance tests in the PR
> > template (maybe next to some of the yes/no questions where a yes might
> > indicate that the performance is affected).
> >
> > About making monitoring of performance tests part of the release
> > management: Right now, watching CI for test instabilities and pinging
> > contributors on issues is already a task for release managers. Extending
> > this responsibility to also check the regression tests seems natural.
> >
> > Yanfei's write-up is a good proposal for general performance test
> > guidelines. It will help contributors and release managers alike. We
> could
> > integrate it into the release testing documentation [1]. I'm wondering
> > whether we would want to add a new Jira label to group these kinds of
> Jira
> > issues analogously to test-instabilities [2].
> >
> > But that said, I want to repeat the idea of organizing the release
> > management documentation in a way that we have not only general release
> > managers with a bunch of different tasks but dedicated roles within the
> > release management. Roles I could imagine based on our experience from
> the
> > past releases are:
> > - CI monitoring: Watching the #builds Slack channel for test
> instabilities
> > - Regression test monitoring: Watching the #flink-dev-benchmarks for
> > regressions
> > - Coordination: Announcements, documentation, organizational stuff (e.g.
> > release call)
> >
> > Having dedicated roles might help finding volunteers for the release
> > management. Volunteers could sign up for a dedicated role. Each role
> would
> > provide a clear set of tasks/responsibilities. They don't restrict us in
> > any way because multiple roles can be also fulfilled by a single person.
> > It's just about cutting the tasks into smaller chunks to make it less
> > overwhelming.
> >
> > Matthias
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Release+Management+and+Feature+Plan
> > [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Jira+Process
> >
> > On Tue, Jan 31, 2023 at 5:16 PM Jing Ge <j...@ververica.com.invalid>
> wrote:
> >
> > > Hi Yanfei,
> > >
> > > Thanks for your proposal and effort of driving it.
> > >
> > > I really like the document on how to deal with the performance
> regressions.
> > > This will coach more developers to be able to work with it. I would
> suggest
> > > that more developers will be aware of the performance regressions
> during
> > > the daily development, not only for release managers. The document
> like you
> > > drafted is the key to get them involved. It would be great if the
> document
> > > could be extended and cover all types of regressions, case by case. It
> will
> > > be a one-time effort and benefit for a long time.
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Tue, Jan 31, 2023 at 12:10 PM Piotr Nowojski <pnowoj...@apache.org>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > @Dong, I think the drawback of your proposal is that it wouldn't
> detect
> > > if
> > > > there is visible performance regression within benchmark noise. While
> > > this
> > > > should be do-able with large enough number of samples
> > > >
> > > > @Yanfei, Gathering medians would have basically the same problems
> with
> > > how
> > > > to deal with two consecutive performance changes. Another issue is
> that I
> > > > think it would be great to have an algorithm, where you can enter
> what's
> > > > your desired confidence interval for the detections as the input.
> This
> > > way
> > > > we might be able to take care generically of all benchmarks,
> regardless
> > > how
> > > > noisy they are. In your proposal, if you set the threshold 5% on the
> > > median
> > > > comparison, the most noisy benchmarks might still give too many false
> > > > positives, while with less noisy benchmarks we would be unnecessarily
> > > > ignoring small regressions.
> > > >
> > > > I was also informed about some tooling created exactly for detecting
> > > > performance regressions from benchmark results:
> > > >
> > > > > fork of  Hunter - a perf change detection tool, originally from
> > > DataStax:
> > > > > Blog post -
> > > >
> > > >
> > >
> https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4
> > > > > Paper - https://arxiv.org/pdf/2301.03034.pdf
> > > > > Our fork - https://github.com/gerrrr/hunter
> > > >
> > > > The algorithm that's used underneath "E-divisive Means" sounds
> promising.
> > > >
> > > > Best,
> > > > Piotrek
> > > >
> > > > wt., 31 sty 2023 o 09:56 Yanfei Lei <fredia...@gmail.com>
> napisał(a):
> > > >
> > > > > Thanks for all the feedback and suggestions.
> > > > >
> > > > > @Piotr:
> > > > > >> I was setting the priority to a blocker and I would propose to
> add
> > > > this
> > > > > to the instructions and general convention.
> > > > >
> > > > > Thanks for sharing your experience, I will update this to the
> document.
> > > > > And your suggestion of leveraging Two-sample Kolmogorov-Smirnov
> test
> > > > > to detect regressions in FLINK-29825[1] is a great idea, I'm
> curious
> > > > > about two things:
> > > > >
> > > > > (1) it's not easy to construct the first distribution: "all latest
> > > > > samples, from the latest one (1st one), until N'th latest sample".
> If
> > > > > some regressions have occurred before, it will "distort" the first
> > > > > distribution, possibly leading to false positive; meanwhile,
> > > > > optimization can also cause "distortion" which possibly leads to
> false
> > > > > negative, It's relatively acceptable. Maybe we can filter out
> outliers
> > > > > to avoid this problem.
> > > > >
> > > > > (2) Also, Kolmogorov-Smirnov test is an absolute value, and it is
> > > > > impossible to identify whether it is an improvement or a
> regression.
> > > > >
> > > > > My plan is/was to use the error to adjust each threshold, each
> > > > > benchmark run has a "Score Error (99.9%)" value in the result[2],
> let
> > > > > threshold = 0.75*(max error of N'th latest sample), and the other
> > > > > logic is the same as the existing detection script.
> > > > >
> > > > >
> > > > > @Dong Lin:
> > > > > The current detection script[3] is similar to the
> > > > > average-throughput-deviation method you described. We use the
> median
> > > > > of the last 100 days as a benchmark, and then compare the median of
> > > > > the last 20 days, and this script may delay the regression alerts.
> It
> > > > > is a good way to detect by deviation, currently each benchmark run
> has
> > > > > a "Score Error (99.9%)" value in the result, maybe we can also
> > > > > calculate "Score Error (99.9%)" into the deviation.
> > > > >
> > > > > In addition, due to the existence of regression/improvement, the
> > > > > median of the baseline is also unstable, I think RobustScaler[4] of
> > > > > flink ML is very suitable here, maybe RobustScaler can be used to
> > > > > calculate the median of baseline.
> > > > > The problem faced by this automatic detection tool is very similar
> to
> > > > > novelty and outlier detection problems, maybe some methods in
> > > > > Sklearn[5] can be considered.
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
> > > > > [2]
> > > > >
> > > >
> > >
> http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/
> > > > > [3]
> > > > >
> > > >
> > >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > > > [4]
> > > > >
> > > >
> > >
> https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/
> > > > > [5] https://scikit-learn.org/stable/modules/outlier_detection.html
> > > > >
> > > > >
> > > > > Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道：
> > > > > >
> > > > > > Hi Piotr,
> > > > > >
> > > > > > Yes, the challenge of developing such an automatic tool is
> indeed to
> > > > > handle
> > > > > > noise and achieve a balance between false positive and false
> > > negative.
> > > > It
> > > > > > is great to know that we already have scripts that can access
> > > > historical
> > > > > > benchmark data and generate alerts.
> > > > > >
> > > > > > There is some heuristic that we can use to considerably reduce
> the
> > > > false
> > > > > > positive alert. For example, we can measure the
> > > > > > average-throughput-deviation of such a benchmark over 5 runs and
> only
> > > > > > suppress alert for benchmarks whose deviation is too high (i.e.
> the
> > > > > > benchmark is too noisy and requires tuning or removal). And we
> can
> > > > > compare
> > > > > > the deviation-from-last-commit with the
> average-throughput-deviation
> > > of
> > > > > > this benchmark and report error if these two values differ to
> much
> > > > (i.e.
> > > > > > the performance regression is observably higher than its typical
> > > > noise).
> > > > > >
> > > > > > I implemented this benchmark framework
> > > > > > <https://github.com/tensorflow/benchmarks/tree/master/perfzero>
> for
> > > > > > TensorFlow three years ago and re-designed a regression detection
> > > > > algorithm
> > > > > > internally at Google using the heuristics described above. I
> recall
> > > > that
> > > > > > the script worked reasonably well and most of the regression
> alert is
> > > > > > actionable. Maybe I can take a shot at implementing a similar
> > > algorithm
> > > > > for
> > > > > > Flink benchmark sometime later and see how it works.
> > > > > >
> > > > > > Anyway, the development of the algorithm is probably not the
> focus of
> > > > > this
> > > > > > topic. +1 for incorporating performance regression monitoring
> into
> > > > > routine
> > > > > > process.
> > > > > >
> > > > > > Best,
> > > > > > Dong
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski <
> > > pnowoj...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Dong,
> > > > > > >
> > > > > > > The main issue with an automatic tool at the moment is that
> some
> > > > > benchmarks
> > > > > > > are quite noisy and performance regressions are often within
> the
> > > > noise
> > > > > of a
> > > > > > > given benchmark. Our currently existing tooling can not handle
> > > those
> > > > > cases.
> > > > > > > Until we address this issue, I think it will have to remain a
> > > manual
> > > > > > > process. There is a ticket mentioned by Yuan [1] where I have
> > > > written a
> > > > > > > comment and a proposal on how to improve the automatic
> performance
> > > > > > > regression detection.
> > > > > > >
> > > > > > > Best,
> > > > > > > Piotrek
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
> > > > > > >
> > > > > > >
> > > > > > > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com>
> > > napisał(a):
> > > > > > >
> > > > > > > > Hi Yanfei,
> > > > > > > >
> > > > > > > > Thanks for driving the benchmark monitoring effort! The
> Google
> > > doc
> > > > > and
> > > > > > > the
> > > > > > > > community wiki looks pretty good.
> > > > > > > >
> > > > > > > > According to Yuan's comment, it seems that we currently
> manually
> > > > > watch
> > > > > > > the
> > > > > > > > benchmark results to detect regression. Have we considered
> > > > automating
> > > > > > > this
> > > > > > > > process by e.g. exporting the nightly benchmark results to a
> > > > > database and
> > > > > > > > using scripts to detect regression based on pre-defined
> rules?
> > > > > > > >
> > > > > > > > This approach is probably more scalable and accurate in the
> long
> > > > > term.
> > > > > > > And
> > > > > > > > I had a good experience working with such a regression
> detection
> > > > > tool in
> > > > > > > my
> > > > > > > > past job.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Dong
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei <
> fredia...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi devs,
> > > > > > > > >
> > > > > > > > > I'd like to start a discussion about incorporating
> performance
> > > > > > > > > regression monitoring into the routine process. Flink
> > > benchmarks
> > > > > are
> > > > > > > > > periodically executed on http://codespeed.dak8s.net:8080
> to
> > > > > monitor
> > > > > > > > > Flink performance. In late Oct'22, a new slack channel
> > > > > > > > > #flink-dev-benchmarks was created for notifications of
> > > > performance
> > > > > > > > > regressions. It helped us find 2 build failures[1,2] and 5
> > > > > performance
> > > > > > > > > regressions[3,4,5,6,7] in the past 3 months, which is very
> > > > > meaningful
> > > > > > > > > to ensuring the quality of the code.
> > > > > > > > >
> > > > > > > > > There are some release managers( cc @Matthias, @Martijn,
> > > > > @Qingsheng)
> > > > > > > > > proposing to incorporate performance regression monitoring
> into
> > > > the
> > > > > > > > > release management, I think it makes sense for performance
> > > > > stabilities
> > > > > > > > > (like CI stabilities), since almost every release has some
> > > > tickets
> > > > > > > > > about performance optimizations, the performance
> monitoring can
> > > > > > > > > effectively avoid performance regression and track the
> > > > performance
> > > > > > > > > improvement of each release. So I start this discussion to
> pick
> > > > > > > > > everyone’s brain for some suggestions.
> > > > > > > > >
> > > > > > > > > In the past, I checked the slack notifications once a week,
> > > and I
> > > > > have
> > > > > > > > > summarized a draft[8](
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > > > > > > )
> > > > > > > > > on how to deal with performance regressions according to
> some
> > > > > > > > > contributors and my own experience. If the above proposal
> is
> > > > > > > > > considered acceptable, I’d like to put it in the community
> > > > wiki[9].
> > > > > > > > >
> > > > > > > > > Looking forward to your feedback!
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-30015
> > > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-29886
> > > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30623
> > > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-30624
> > > > > > > > > [7] https://issues.apache.org/jira/browse/FLINK-30625
> > > > > > > > > [8]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > > > > > > [9]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Yanfei
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Yanfei
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Incorporate performance regression monitoring into routine process

Reply via email to