Hi all, +1 for the overall proposal. My feedback matches with what Matthias has already provided earlier.
Best regards, Martijn Op di 31 jan. 2023 om 19:01 schreef Matthias Pohl <matthias.p...@aiven.io.invalid>: > > Thanks for the effort you put into this discussion, Yanfei. > > For me, the regression tests are similar to test instabilities. In this > sense, I agree with what Piotr and Jing said about it: > - It needs to be identified and fixed as soon as possible to avoid the > change affecting other contributions (e.g. hiding other regressions) and > making it harder to revert them. > - Contributors should be made aware of regressions that are caused by their > commits during the daily development. They should be enabled to > pro-actively react to instabilities/regressions that were caused by their > contributions. The available Slack channel and ML lists provide good tools > to enable contributors to be proactive. > > My experience is that contributors are quite proactive when it comes to > resolving test instabilities on master. Still, a dedicated role for > watching over CI is necessary to identify issues that slipped through. The > same applies to performance tests. Eventhough, I could imagine that for > performance tests, the pro-activeness of contributors is lower because > quite a few changes are just not affecting the performance. One idea to > raise awareness might be to mention the performance tests in the PR > template (maybe next to some of the yes/no questions where a yes might > indicate that the performance is affected). > > About making monitoring of performance tests part of the release > management: Right now, watching CI for test instabilities and pinging > contributors on issues is already a task for release managers. Extending > this responsibility to also check the regression tests seems natural. > > Yanfei's write-up is a good proposal for general performance test > guidelines. It will help contributors and release managers alike. We could > integrate it into the release testing documentation [1]. I'm wondering > whether we would want to add a new Jira label to group these kinds of Jira > issues analogously to test-instabilities [2]. > > But that said, I want to repeat the idea of organizing the release > management documentation in a way that we have not only general release > managers with a bunch of different tasks but dedicated roles within the > release management. Roles I could imagine based on our experience from the > past releases are: > - CI monitoring: Watching the #builds Slack channel for test instabilities > - Regression test monitoring: Watching the #flink-dev-benchmarks for > regressions > - Coordination: Announcements, documentation, organizational stuff (e.g. > release call) > > Having dedicated roles might help finding volunteers for the release > management. Volunteers could sign up for a dedicated role. Each role would > provide a clear set of tasks/responsibilities. They don't restrict us in > any way because multiple roles can be also fulfilled by a single person. > It's just about cutting the tasks into smaller chunks to make it less > overwhelming. > > Matthias > > [1] > https://cwiki.apache.org/confluence/display/FLINK/Release+Management+and+Feature+Plan > [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Jira+Process > > On Tue, Jan 31, 2023 at 5:16 PM Jing Ge <j...@ververica.com.invalid> wrote: > > > Hi Yanfei, > > > > Thanks for your proposal and effort of driving it. > > > > I really like the document on how to deal with the performance regressions. > > This will coach more developers to be able to work with it. I would suggest > > that more developers will be aware of the performance regressions during > > the daily development, not only for release managers. The document like you > > drafted is the key to get them involved. It would be great if the document > > could be extended and cover all types of regressions, case by case. It will > > be a one-time effort and benefit for a long time. > > > > Best regards, > > Jing > > > > On Tue, Jan 31, 2023 at 12:10 PM Piotr Nowojski <pnowoj...@apache.org> > > wrote: > > > > > Hi, > > > > > > @Dong, I think the drawback of your proposal is that it wouldn't detect > > if > > > there is visible performance regression within benchmark noise. While > > this > > > should be do-able with large enough number of samples > > > > > > @Yanfei, Gathering medians would have basically the same problems with > > how > > > to deal with two consecutive performance changes. Another issue is that I > > > think it would be great to have an algorithm, where you can enter what's > > > your desired confidence interval for the detections as the input. This > > way > > > we might be able to take care generically of all benchmarks, regardless > > how > > > noisy they are. In your proposal, if you set the threshold 5% on the > > median > > > comparison, the most noisy benchmarks might still give too many false > > > positives, while with less noisy benchmarks we would be unnecessarily > > > ignoring small regressions. > > > > > > I was also informed about some tooling created exactly for detecting > > > performance regressions from benchmark results: > > > > > > > fork of Hunter - a perf change detection tool, originally from > > DataStax: > > > > Blog post - > > > > > > > > https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4 > > > > Paper - https://arxiv.org/pdf/2301.03034.pdf > > > > Our fork - https://github.com/gerrrr/hunter > > > > > > The algorithm that's used underneath "E-divisive Means" sounds promising. > > > > > > Best, > > > Piotrek > > > > > > wt., 31 sty 2023 o 09:56 Yanfei Lei <fredia...@gmail.com> napisał(a): > > > > > > > Thanks for all the feedback and suggestions. > > > > > > > > @Piotr: > > > > >> I was setting the priority to a blocker and I would propose to add > > > this > > > > to the instructions and general convention. > > > > > > > > Thanks for sharing your experience, I will update this to the document. > > > > And your suggestion of leveraging Two-sample Kolmogorov-Smirnov test > > > > to detect regressions in FLINK-29825[1] is a great idea, I'm curious > > > > about two things: > > > > > > > > (1) it's not easy to construct the first distribution: "all latest > > > > samples, from the latest one (1st one), until N'th latest sample". If > > > > some regressions have occurred before, it will "distort" the first > > > > distribution, possibly leading to false positive; meanwhile, > > > > optimization can also cause "distortion" which possibly leads to false > > > > negative, It's relatively acceptable. Maybe we can filter out outliers > > > > to avoid this problem. > > > > > > > > (2) Also, Kolmogorov-Smirnov test is an absolute value, and it is > > > > impossible to identify whether it is an improvement or a regression. > > > > > > > > My plan is/was to use the error to adjust each threshold, each > > > > benchmark run has a "Score Error (99.9%)" value in the result[2], let > > > > threshold = 0.75*(max error of N'th latest sample), and the other > > > > logic is the same as the existing detection script. > > > > > > > > > > > > @Dong Lin: > > > > The current detection script[3] is similar to the > > > > average-throughput-deviation method you described. We use the median > > > > of the last 100 days as a benchmark, and then compare the median of > > > > the last 20 days, and this script may delay the regression alerts. It > > > > is a good way to detect by deviation, currently each benchmark run has > > > > a "Score Error (99.9%)" value in the result, maybe we can also > > > > calculate "Score Error (99.9%)" into the deviation. > > > > > > > > In addition, due to the existence of regression/improvement, the > > > > median of the baseline is also unstable, I think RobustScaler[4] of > > > > flink ML is very suitable here, maybe RobustScaler can be used to > > > > calculate the median of baseline. > > > > The problem faced by this automatic detection tool is very similar to > > > > novelty and outlier detection problems, maybe some methods in > > > > Sklearn[5] can be considered. > > > > > > > > [1] > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > > [2] > > > > > > > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/ > > > > [3] > > > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py > > > > [4] > > > > > > > > > https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/ > > > > [5] https://scikit-learn.org/stable/modules/outlier_detection.html > > > > > > > > > > > > Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道: > > > > > > > > > > Hi Piotr, > > > > > > > > > > Yes, the challenge of developing such an automatic tool is indeed to > > > > handle > > > > > noise and achieve a balance between false positive and false > > negative. > > > It > > > > > is great to know that we already have scripts that can access > > > historical > > > > > benchmark data and generate alerts. > > > > > > > > > > There is some heuristic that we can use to considerably reduce the > > > false > > > > > positive alert. For example, we can measure the > > > > > average-throughput-deviation of such a benchmark over 5 runs and only > > > > > suppress alert for benchmarks whose deviation is too high (i.e. the > > > > > benchmark is too noisy and requires tuning or removal). And we can > > > > compare > > > > > the deviation-from-last-commit with the average-throughput-deviation > > of > > > > > this benchmark and report error if these two values differ to much > > > (i.e. > > > > > the performance regression is observably higher than its typical > > > noise). > > > > > > > > > > I implemented this benchmark framework > > > > > <https://github.com/tensorflow/benchmarks/tree/master/perfzero> for > > > > > TensorFlow three years ago and re-designed a regression detection > > > > algorithm > > > > > internally at Google using the heuristics described above. I recall > > > that > > > > > the script worked reasonably well and most of the regression alert is > > > > > actionable. Maybe I can take a shot at implementing a similar > > algorithm > > > > for > > > > > Flink benchmark sometime later and see how it works. > > > > > > > > > > Anyway, the development of the algorithm is probably not the focus of > > > > this > > > > > topic. +1 for incorporating performance regression monitoring into > > > > routine > > > > > process. > > > > > > > > > > Best, > > > > > Dong > > > > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski < > > pnowoj...@apache.org> > > > > > wrote: > > > > > > > > > > > Hi Dong, > > > > > > > > > > > > The main issue with an automatic tool at the moment is that some > > > > benchmarks > > > > > > are quite noisy and performance regressions are often within the > > > noise > > > > of a > > > > > > given benchmark. Our currently existing tooling can not handle > > those > > > > cases. > > > > > > Until we address this issue, I think it will have to remain a > > manual > > > > > > process. There is a ticket mentioned by Yuan [1] where I have > > > written a > > > > > > comment and a proposal on how to improve the automatic performance > > > > > > regression detection. > > > > > > > > > > > > Best, > > > > > > Piotrek > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > > > > > > > > > > > > > > > > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com> > > napisał(a): > > > > > > > > > > > > > Hi Yanfei, > > > > > > > > > > > > > > Thanks for driving the benchmark monitoring effort! The Google > > doc > > > > and > > > > > > the > > > > > > > community wiki looks pretty good. > > > > > > > > > > > > > > According to Yuan's comment, it seems that we currently manually > > > > watch > > > > > > the > > > > > > > benchmark results to detect regression. Have we considered > > > automating > > > > > > this > > > > > > > process by e.g. exporting the nightly benchmark results to a > > > > database and > > > > > > > using scripts to detect regression based on pre-defined rules? > > > > > > > > > > > > > > This approach is probably more scalable and accurate in the long > > > > term. > > > > > > And > > > > > > > I had a good experience working with such a regression detection > > > > tool in > > > > > > my > > > > > > > past job. > > > > > > > > > > > > > > Thanks, > > > > > > > Dong > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei <fredia...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > I'd like to start a discussion about incorporating performance > > > > > > > > regression monitoring into the routine process. Flink > > benchmarks > > > > are > > > > > > > > periodically executed on http://codespeed.dak8s.net:8080 to > > > > monitor > > > > > > > > Flink performance. In late Oct'22, a new slack channel > > > > > > > > #flink-dev-benchmarks was created for notifications of > > > performance > > > > > > > > regressions. It helped us find 2 build failures[1,2] and 5 > > > > performance > > > > > > > > regressions[3,4,5,6,7] in the past 3 months, which is very > > > > meaningful > > > > > > > > to ensuring the quality of the code. > > > > > > > > > > > > > > > > There are some release managers( cc @Matthias, @Martijn, > > > > @Qingsheng) > > > > > > > > proposing to incorporate performance regression monitoring into > > > the > > > > > > > > release management, I think it makes sense for performance > > > > stabilities > > > > > > > > (like CI stabilities), since almost every release has some > > > tickets > > > > > > > > about performance optimizations, the performance monitoring can > > > > > > > > effectively avoid performance regression and track the > > > performance > > > > > > > > improvement of each release. So I start this discussion to pick > > > > > > > > everyone’s brain for some suggestions. > > > > > > > > > > > > > > > > In the past, I checked the slack notifications once a week, > > and I > > > > have > > > > > > > > summarized a draft[8]( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > > ) > > > > > > > > on how to deal with performance regressions according to some > > > > > > > > contributors and my own experience. If the above proposal is > > > > > > > > considered acceptable, I’d like to put it in the community > > > wiki[9]. > > > > > > > > > > > > > > > > Looking forward to your feedback! > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-30015 > > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-29886 > > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30623 > > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-30624 > > > > > > > > [7] https://issues.apache.org/jira/browse/FLINK-30625 > > > > > > > > [8] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > > [9] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847 > > > > > > > > > > > > > > > > Best, > > > > > > > > Yanfei > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best, > > > > Yanfei > > > > > > > > >