Thanks for the effort you put into this discussion, Yanfei. For me, the regression tests are similar to test instabilities. In this sense, I agree with what Piotr and Jing said about it: - It needs to be identified and fixed as soon as possible to avoid the change affecting other contributions (e.g. hiding other regressions) and making it harder to revert them. - Contributors should be made aware of regressions that are caused by their commits during the daily development. They should be enabled to pro-actively react to instabilities/regressions that were caused by their contributions. The available Slack channel and ML lists provide good tools to enable contributors to be proactive.
My experience is that contributors are quite proactive when it comes to resolving test instabilities on master. Still, a dedicated role for watching over CI is necessary to identify issues that slipped through. The same applies to performance tests. Eventhough, I could imagine that for performance tests, the pro-activeness of contributors is lower because quite a few changes are just not affecting the performance. One idea to raise awareness might be to mention the performance tests in the PR template (maybe next to some of the yes/no questions where a yes might indicate that the performance is affected). About making monitoring of performance tests part of the release management: Right now, watching CI for test instabilities and pinging contributors on issues is already a task for release managers. Extending this responsibility to also check the regression tests seems natural. Yanfei's write-up is a good proposal for general performance test guidelines. It will help contributors and release managers alike. We could integrate it into the release testing documentation [1]. I'm wondering whether we would want to add a new Jira label to group these kinds of Jira issues analogously to test-instabilities [2]. But that said, I want to repeat the idea of organizing the release management documentation in a way that we have not only general release managers with a bunch of different tasks but dedicated roles within the release management. Roles I could imagine based on our experience from the past releases are: - CI monitoring: Watching the #builds Slack channel for test instabilities - Regression test monitoring: Watching the #flink-dev-benchmarks for regressions - Coordination: Announcements, documentation, organizational stuff (e.g. release call) Having dedicated roles might help finding volunteers for the release management. Volunteers could sign up for a dedicated role. Each role would provide a clear set of tasks/responsibilities. They don't restrict us in any way because multiple roles can be also fulfilled by a single person. It's just about cutting the tasks into smaller chunks to make it less overwhelming. Matthias [1] https://cwiki.apache.org/confluence/display/FLINK/Release+Management+and+Feature+Plan [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Jira+Process On Tue, Jan 31, 2023 at 5:16 PM Jing Ge <j...@ververica.com.invalid> wrote: > Hi Yanfei, > > Thanks for your proposal and effort of driving it. > > I really like the document on how to deal with the performance regressions. > This will coach more developers to be able to work with it. I would suggest > that more developers will be aware of the performance regressions during > the daily development, not only for release managers. The document like you > drafted is the key to get them involved. It would be great if the document > could be extended and cover all types of regressions, case by case. It will > be a one-time effort and benefit for a long time. > > Best regards, > Jing > > On Tue, Jan 31, 2023 at 12:10 PM Piotr Nowojski <pnowoj...@apache.org> > wrote: > > > Hi, > > > > @Dong, I think the drawback of your proposal is that it wouldn't detect > if > > there is visible performance regression within benchmark noise. While > this > > should be do-able with large enough number of samples > > > > @Yanfei, Gathering medians would have basically the same problems with > how > > to deal with two consecutive performance changes. Another issue is that I > > think it would be great to have an algorithm, where you can enter what's > > your desired confidence interval for the detections as the input. This > way > > we might be able to take care generically of all benchmarks, regardless > how > > noisy they are. In your proposal, if you set the threshold 5% on the > median > > comparison, the most noisy benchmarks might still give too many false > > positives, while with less noisy benchmarks we would be unnecessarily > > ignoring small regressions. > > > > I was also informed about some tooling created exactly for detecting > > performance regressions from benchmark results: > > > > > fork of Hunter - a perf change detection tool, originally from > DataStax: > > > Blog post - > > > > > https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4 > > > Paper - https://arxiv.org/pdf/2301.03034.pdf > > > Our fork - https://github.com/gerrrr/hunter > > > > The algorithm that's used underneath "E-divisive Means" sounds promising. > > > > Best, > > Piotrek > > > > wt., 31 sty 2023 o 09:56 Yanfei Lei <fredia...@gmail.com> napisał(a): > > > > > Thanks for all the feedback and suggestions. > > > > > > @Piotr: > > > >> I was setting the priority to a blocker and I would propose to add > > this > > > to the instructions and general convention. > > > > > > Thanks for sharing your experience, I will update this to the document. > > > And your suggestion of leveraging Two-sample Kolmogorov-Smirnov test > > > to detect regressions in FLINK-29825[1] is a great idea, I'm curious > > > about two things: > > > > > > (1) it's not easy to construct the first distribution: "all latest > > > samples, from the latest one (1st one), until N'th latest sample". If > > > some regressions have occurred before, it will "distort" the first > > > distribution, possibly leading to false positive; meanwhile, > > > optimization can also cause "distortion" which possibly leads to false > > > negative, It's relatively acceptable. Maybe we can filter out outliers > > > to avoid this problem. > > > > > > (2) Also, Kolmogorov-Smirnov test is an absolute value, and it is > > > impossible to identify whether it is an improvement or a regression. > > > > > > My plan is/was to use the error to adjust each threshold, each > > > benchmark run has a "Score Error (99.9%)" value in the result[2], let > > > threshold = 0.75*(max error of N'th latest sample), and the other > > > logic is the same as the existing detection script. > > > > > > > > > @Dong Lin: > > > The current detection script[3] is similar to the > > > average-throughput-deviation method you described. We use the median > > > of the last 100 days as a benchmark, and then compare the median of > > > the last 20 days, and this script may delay the regression alerts. It > > > is a good way to detect by deviation, currently each benchmark run has > > > a "Score Error (99.9%)" value in the result, maybe we can also > > > calculate "Score Error (99.9%)" into the deviation. > > > > > > In addition, due to the existence of regression/improvement, the > > > median of the baseline is also unstable, I think RobustScaler[4] of > > > flink ML is very suitable here, maybe RobustScaler can be used to > > > calculate the median of baseline. > > > The problem faced by this automatic detection tool is very similar to > > > novelty and outlier detection problems, maybe some methods in > > > Sklearn[5] can be considered. > > > > > > [1] > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > [2] > > > > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/ > > > [3] > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py > > > [4] > > > > > > https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/ > > > [5] https://scikit-learn.org/stable/modules/outlier_detection.html > > > > > > > > > Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道: > > > > > > > > Hi Piotr, > > > > > > > > Yes, the challenge of developing such an automatic tool is indeed to > > > handle > > > > noise and achieve a balance between false positive and false > negative. > > It > > > > is great to know that we already have scripts that can access > > historical > > > > benchmark data and generate alerts. > > > > > > > > There is some heuristic that we can use to considerably reduce the > > false > > > > positive alert. For example, we can measure the > > > > average-throughput-deviation of such a benchmark over 5 runs and only > > > > suppress alert for benchmarks whose deviation is too high (i.e. the > > > > benchmark is too noisy and requires tuning or removal). And we can > > > compare > > > > the deviation-from-last-commit with the average-throughput-deviation > of > > > > this benchmark and report error if these two values differ to much > > (i.e. > > > > the performance regression is observably higher than its typical > > noise). > > > > > > > > I implemented this benchmark framework > > > > <https://github.com/tensorflow/benchmarks/tree/master/perfzero> for > > > > TensorFlow three years ago and re-designed a regression detection > > > algorithm > > > > internally at Google using the heuristics described above. I recall > > that > > > > the script worked reasonably well and most of the regression alert is > > > > actionable. Maybe I can take a shot at implementing a similar > algorithm > > > for > > > > Flink benchmark sometime later and see how it works. > > > > > > > > Anyway, the development of the algorithm is probably not the focus of > > > this > > > > topic. +1 for incorporating performance regression monitoring into > > > routine > > > > process. > > > > > > > > Best, > > > > Dong > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski < > pnowoj...@apache.org> > > > > wrote: > > > > > > > > > Hi Dong, > > > > > > > > > > The main issue with an automatic tool at the moment is that some > > > benchmarks > > > > > are quite noisy and performance regressions are often within the > > noise > > > of a > > > > > given benchmark. Our currently existing tooling can not handle > those > > > cases. > > > > > Until we address this issue, I think it will have to remain a > manual > > > > > process. There is a ticket mentioned by Yuan [1] where I have > > written a > > > > > comment and a proposal on how to improve the automatic performance > > > > > regression detection. > > > > > > > > > > Best, > > > > > Piotrek > > > > > > > > > > [1] > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > > > > > > > > > > > > > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com> > napisał(a): > > > > > > > > > > > Hi Yanfei, > > > > > > > > > > > > Thanks for driving the benchmark monitoring effort! The Google > doc > > > and > > > > > the > > > > > > community wiki looks pretty good. > > > > > > > > > > > > According to Yuan's comment, it seems that we currently manually > > > watch > > > > > the > > > > > > benchmark results to detect regression. Have we considered > > automating > > > > > this > > > > > > process by e.g. exporting the nightly benchmark results to a > > > database and > > > > > > using scripts to detect regression based on pre-defined rules? > > > > > > > > > > > > This approach is probably more scalable and accurate in the long > > > term. > > > > > And > > > > > > I had a good experience working with such a regression detection > > > tool in > > > > > my > > > > > > past job. > > > > > > > > > > > > Thanks, > > > > > > Dong > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei <fredia...@gmail.com> > > > wrote: > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > I'd like to start a discussion about incorporating performance > > > > > > > regression monitoring into the routine process. Flink > benchmarks > > > are > > > > > > > periodically executed on http://codespeed.dak8s.net:8080 to > > > monitor > > > > > > > Flink performance. In late Oct'22, a new slack channel > > > > > > > #flink-dev-benchmarks was created for notifications of > > performance > > > > > > > regressions. It helped us find 2 build failures[1,2] and 5 > > > performance > > > > > > > regressions[3,4,5,6,7] in the past 3 months, which is very > > > meaningful > > > > > > > to ensuring the quality of the code. > > > > > > > > > > > > > > There are some release managers( cc @Matthias, @Martijn, > > > @Qingsheng) > > > > > > > proposing to incorporate performance regression monitoring into > > the > > > > > > > release management, I think it makes sense for performance > > > stabilities > > > > > > > (like CI stabilities), since almost every release has some > > tickets > > > > > > > about performance optimizations, the performance monitoring can > > > > > > > effectively avoid performance regression and track the > > performance > > > > > > > improvement of each release. So I start this discussion to pick > > > > > > > everyone’s brain for some suggestions. > > > > > > > > > > > > > > In the past, I checked the slack notifications once a week, > and I > > > have > > > > > > > summarized a draft[8]( > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > ) > > > > > > > on how to deal with performance regressions according to some > > > > > > > contributors and my own experience. If the above proposal is > > > > > > > considered acceptable, I’d like to put it in the community > > wiki[9]. > > > > > > > > > > > > > > Looking forward to your feedback! > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-30015 > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-29886 > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30623 > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-30624 > > > > > > > [7] https://issues.apache.org/jira/browse/FLINK-30625 > > > > > > > [8] > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > [9] > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847 > > > > > > > > > > > > > > Best, > > > > > > > Yanfei > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best, > > > Yanfei > > > > > >