Re: [DISCUSS] Incorporate performance regression monitoring into routine process

Martijn Visser Thu, 02 Feb 2023 05:09:26 -0800

Hi all,

+1 for the overall proposal. My feedback matches with what Matthias
has already provided earlier.


Best regards,

Martijn

Op di 31 jan. 2023 om 19:01 schreef Matthias Pohl
<matthias.p...@aiven.io.invalid>:
>
> Thanks for the effort you put into this discussion, Yanfei.
>
> For me, the regression tests are similar to test instabilities. In this
> sense, I agree with what Piotr and Jing said about it:
> - It needs to be identified and fixed as soon as possible to avoid the
> change affecting other contributions (e.g. hiding other regressions) and
> making it harder to revert them.
> - Contributors should be made aware of regressions that are caused by their
> commits during the daily development. They should be enabled to
> pro-actively react to instabilities/regressions that were caused by their
> contributions. The available Slack channel and ML lists provide good tools
> to enable contributors to be proactive.
>
> My experience is that contributors are quite proactive when it comes to
> resolving test instabilities on master. Still, a dedicated role for
> watching over CI is necessary to identify issues that slipped through. The
> same applies to performance tests. Eventhough, I could imagine that for
> performance tests, the pro-activeness of contributors is lower because
> quite a few changes are just not affecting the performance. One idea  to
> raise awareness might be to mention the performance tests in the PR
> template (maybe next to some of the yes/no questions where a yes might
> indicate that the performance is affected).
>
> About making monitoring of performance tests part of the release
> management: Right now, watching CI for test instabilities and pinging
> contributors on issues is already a task for release managers. Extending
> this responsibility to also check the regression tests seems natural.
>
> Yanfei's write-up is a good proposal for general performance test
> guidelines. It will help contributors and release managers alike. We could
> integrate it into the release testing documentation [1]. I'm wondering
> whether we would want to add a new Jira label to group these kinds of Jira
> issues analogously to test-instabilities [2].
>
> But that said, I want to repeat the idea of organizing the release
> management documentation in a way that we have not only general release
> managers with a bunch of different tasks but dedicated roles within the
> release management. Roles I could imagine based on our experience from the
> past releases are:
> - CI monitoring: Watching the #builds Slack channel for test instabilities
> - Regression test monitoring: Watching the #flink-dev-benchmarks for
> regressions
> - Coordination: Announcements, documentation, organizational stuff (e.g.
> release call)
>
> Having dedicated roles might help finding volunteers for the release
> management. Volunteers could sign up for a dedicated role. Each role would
> provide a clear set of tasks/responsibilities. They don't restrict us in
> any way because multiple roles can be also fulfilled by a single person.
> It's just about cutting the tasks into smaller chunks to make it less
> overwhelming.
>
> Matthias
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Release+Management+and+Feature+Plan
> [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Jira+Process
>
> On Tue, Jan 31, 2023 at 5:16 PM Jing Ge <j...@ververica.com.invalid> wrote:
>
> > Hi Yanfei,
> >
> > Thanks for your proposal and effort of driving it.
> >
> > I really like the document on how to deal with the performance regressions.
> > This will coach more developers to be able to work with it. I would suggest
> > that more developers will be aware of the performance regressions during
> > the daily development, not only for release managers. The document like you
> > drafted is the key to get them involved. It would be great if the document
> > could be extended and cover all types of regressions, case by case. It will
> > be a one-time effort and benefit for a long time.
> >
> > Best regards,
> > Jing
> >
> > On Tue, Jan 31, 2023 at 12:10 PM Piotr Nowojski <pnowoj...@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > @Dong, I think the drawback of your proposal is that it wouldn't detect
> > if
> > > there is visible performance regression within benchmark noise. While
> > this
> > > should be do-able with large enough number of samples
> > >
> > > @Yanfei, Gathering medians would have basically the same problems with
> > how
> > > to deal with two consecutive performance changes. Another issue is that I
> > > think it would be great to have an algorithm, where you can enter what's
> > > your desired confidence interval for the detections as the input. This
> > way
> > > we might be able to take care generically of all benchmarks, regardless
> > how
> > > noisy they are. In your proposal, if you set the threshold 5% on the
> > median
> > > comparison, the most noisy benchmarks might still give too many false
> > > positives, while with less noisy benchmarks we would be unnecessarily
> > > ignoring small regressions.
> > >
> > > I was also informed about some tooling created exactly for detecting
> > > performance regressions from benchmark results:
> > >
> > > > fork of  Hunter - a perf change detection tool, originally from
> > DataStax:
> > > > Blog post -
> > >
> > >
> > https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4
> > > > Paper - https://arxiv.org/pdf/2301.03034.pdf
> > > > Our fork - https://github.com/gerrrr/hunter
> > >
> > > The algorithm that's used underneath "E-divisive Means" sounds promising.
> > >
> > > Best,
> > > Piotrek
> > >
> > > wt., 31 sty 2023 o 09:56 Yanfei Lei <fredia...@gmail.com> napisał(a):
> > >
> > > > Thanks for all the feedback and suggestions.
> > > >
> > > > @Piotr:
> > > > >> I was setting the priority to a blocker and I would propose to add
> > > this
> > > > to the instructions and general convention.
> > > >
> > > > Thanks for sharing your experience, I will update this to the document.
> > > > And your suggestion of leveraging Two-sample Kolmogorov-Smirnov test
> > > > to detect regressions in FLINK-29825[1] is a great idea, I'm curious
> > > > about two things:
> > > >
> > > > (1) it's not easy to construct the first distribution: "all latest
> > > > samples, from the latest one (1st one), until N'th latest sample". If
> > > > some regressions have occurred before, it will "distort" the first
> > > > distribution, possibly leading to false positive; meanwhile,
> > > > optimization can also cause "distortion" which possibly leads to false
> > > > negative, It's relatively acceptable. Maybe we can filter out outliers
> > > > to avoid this problem.
> > > >
> > > > (2) Also, Kolmogorov-Smirnov test is an absolute value, and it is
> > > > impossible to identify whether it is an improvement or a regression.
> > > >
> > > > My plan is/was to use the error to adjust each threshold, each
> > > > benchmark run has a "Score Error (99.9%)" value in the result[2], let
> > > > threshold = 0.75*(max error of N'th latest sample), and the other
> > > > logic is the same as the existing detection script.
> > > >
> > > >
> > > > @Dong Lin:
> > > > The current detection script[3] is similar to the
> > > > average-throughput-deviation method you described. We use the median
> > > > of the last 100 days as a benchmark, and then compare the median of
> > > > the last 20 days, and this script may delay the regression alerts. It
> > > > is a good way to detect by deviation, currently each benchmark run has
> > > > a "Score Error (99.9%)" value in the result, maybe we can also
> > > > calculate "Score Error (99.9%)" into the deviation.
> > > >
> > > > In addition, due to the existence of regression/improvement, the
> > > > median of the baseline is also unstable, I think RobustScaler[4] of
> > > > flink ML is very suitable here, maybe RobustScaler can be used to
> > > > calculate the median of baseline.
> > > > The problem faced by this automatic detection tool is very similar to
> > > > novelty and outlier detection problems, maybe some methods in
> > > > Sklearn[5] can be considered.
> > > >
> > > > [1]
> > > >
> > >
> > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
> > > > [2]
> > > >
> > >
> > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/
> > > > [3]
> > > >
> > >
> > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > > [4]
> > > >
> > >
> > https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/
> > > > [5] https://scikit-learn.org/stable/modules/outlier_detection.html
> > > >
> > > >
> > > > Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道：
> > > > >
> > > > > Hi Piotr,
> > > > >
> > > > > Yes, the challenge of developing such an automatic tool is indeed to
> > > > handle
> > > > > noise and achieve a balance between false positive and false
> > negative.
> > > It
> > > > > is great to know that we already have scripts that can access
> > > historical
> > > > > benchmark data and generate alerts.
> > > > >
> > > > > There is some heuristic that we can use to considerably reduce the
> > > false
> > > > > positive alert. For example, we can measure the
> > > > > average-throughput-deviation of such a benchmark over 5 runs and only
> > > > > suppress alert for benchmarks whose deviation is too high (i.e. the
> > > > > benchmark is too noisy and requires tuning or removal). And we can
> > > > compare
> > > > > the deviation-from-last-commit with the average-throughput-deviation
> > of
> > > > > this benchmark and report error if these two values differ to much
> > > (i.e.
> > > > > the performance regression is observably higher than its typical
> > > noise).
> > > > >
> > > > > I implemented this benchmark framework
> > > > > <https://github.com/tensorflow/benchmarks/tree/master/perfzero> for
> > > > > TensorFlow three years ago and re-designed a regression detection
> > > > algorithm
> > > > > internally at Google using the heuristics described above. I recall
> > > that
> > > > > the script worked reasonably well and most of the regression alert is
> > > > > actionable. Maybe I can take a shot at implementing a similar
> > algorithm
> > > > for
> > > > > Flink benchmark sometime later and see how it works.
> > > > >
> > > > > Anyway, the development of the algorithm is probably not the focus of
> > > > this
> > > > > topic. +1 for incorporating performance regression monitoring into
> > > > routine
> > > > > process.
> > > > >
> > > > > Best,
> > > > > Dong
> > > > >
> > > > >
> > > > > On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski <
> > pnowoj...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi Dong,
> > > > > >
> > > > > > The main issue with an automatic tool at the moment is that some
> > > > benchmarks
> > > > > > are quite noisy and performance regressions are often within the
> > > noise
> > > > of a
> > > > > > given benchmark. Our currently existing tooling can not handle
> > those
> > > > cases.
> > > > > > Until we address this issue, I think it will have to remain a
> > manual
> > > > > > process. There is a ticket mentioned by Yuan [1] where I have
> > > written a
> > > > > > comment and a proposal on how to improve the automatic performance
> > > > > > regression detection.
> > > > > >
> > > > > > Best,
> > > > > > Piotrek
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > >
> > >
> > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077
> > > > > >
> > > > > >
> > > > > > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com>
> > napisał(a):
> > > > > >
> > > > > > > Hi Yanfei,
> > > > > > >
> > > > > > > Thanks for driving the benchmark monitoring effort! The Google
> > doc
> > > > and
> > > > > > the
> > > > > > > community wiki looks pretty good.
> > > > > > >
> > > > > > > According to Yuan's comment, it seems that we currently manually
> > > > watch
> > > > > > the
> > > > > > > benchmark results to detect regression. Have we considered
> > > automating
> > > > > > this
> > > > > > > process by e.g. exporting the nightly benchmark results to a
> > > > database and
> > > > > > > using scripts to detect regression based on pre-defined rules?
> > > > > > >
> > > > > > > This approach is probably more scalable and accurate in the long
> > > > term.
> > > > > > And
> > > > > > > I had a good experience working with such a regression detection
> > > > tool in
> > > > > > my
> > > > > > > past job.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dong
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei <fredia...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi devs,
> > > > > > > >
> > > > > > > > I'd like to start a discussion about incorporating performance
> > > > > > > > regression monitoring into the routine process. Flink
> > benchmarks
> > > > are
> > > > > > > > periodically executed on http://codespeed.dak8s.net:8080 to
> > > > monitor
> > > > > > > > Flink performance. In late Oct'22, a new slack channel
> > > > > > > > #flink-dev-benchmarks was created for notifications of
> > > performance
> > > > > > > > regressions. It helped us find 2 build failures[1,2] and 5
> > > > performance
> > > > > > > > regressions[3,4,5,6,7] in the past 3 months, which is very
> > > > meaningful
> > > > > > > > to ensuring the quality of the code.
> > > > > > > >
> > > > > > > > There are some release managers( cc @Matthias, @Martijn,
> > > > @Qingsheng)
> > > > > > > > proposing to incorporate performance regression monitoring into
> > > the
> > > > > > > > release management, I think it makes sense for performance
> > > > stabilities
> > > > > > > > (like CI stabilities), since almost every release has some
> > > tickets
> > > > > > > > about performance optimizations, the performance monitoring can
> > > > > > > > effectively avoid performance regression and track the
> > > performance
> > > > > > > > improvement of each release. So I start this discussion to pick
> > > > > > > > everyone’s brain for some suggestions.
> > > > > > > >
> > > > > > > > In the past, I checked the slack notifications once a week,
> > and I
> > > > have
> > > > > > > > summarized a draft[8](
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > > > > > )
> > > > > > > > on how to deal with performance regressions according to some
> > > > > > > > contributors and my own experience. If the above proposal is
> > > > > > > > considered acceptable, I’d like to put it in the community
> > > wiki[9].
> > > > > > > >
> > > > > > > > Looking forward to your feedback!
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883
> > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-30015
> > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-29886
> > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30181
> > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30623
> > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-30624
> > > > > > > > [7] https://issues.apache.org/jira/browse/FLINK-30625
> > > > > > > > [8]
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing
> > > > > > > > [9]
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yanfei
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Yanfei
> > > >
> > >
> >

Re: [DISCUSS] Incorporate performance regression monitoring into routine process

Reply via email to