+1 I don't have anything substantive to add, but I want to express how pleased I am to see this conversation happening.
David On Thu, Feb 2, 2023 at 5:09 AM Martijn Visser <martijnvis...@apache.org> wrote: > Hi all, > > +1 for the overall proposal. My feedback matches with what Matthias > has already provided earlier. > > Best regards, > > Martijn > > Op di 31 jan. 2023 om 19:01 schreef Matthias Pohl > <matthias.p...@aiven.io.invalid>: > > > > Thanks for the effort you put into this discussion, Yanfei. > > > > For me, the regression tests are similar to test instabilities. In this > > sense, I agree with what Piotr and Jing said about it: > > - It needs to be identified and fixed as soon as possible to avoid the > > change affecting other contributions (e.g. hiding other regressions) and > > making it harder to revert them. > > - Contributors should be made aware of regressions that are caused by > their > > commits during the daily development. They should be enabled to > > pro-actively react to instabilities/regressions that were caused by their > > contributions. The available Slack channel and ML lists provide good > tools > > to enable contributors to be proactive. > > > > My experience is that contributors are quite proactive when it comes to > > resolving test instabilities on master. Still, a dedicated role for > > watching over CI is necessary to identify issues that slipped through. > The > > same applies to performance tests. Eventhough, I could imagine that for > > performance tests, the pro-activeness of contributors is lower because > > quite a few changes are just not affecting the performance. One idea to > > raise awareness might be to mention the performance tests in the PR > > template (maybe next to some of the yes/no questions where a yes might > > indicate that the performance is affected). > > > > About making monitoring of performance tests part of the release > > management: Right now, watching CI for test instabilities and pinging > > contributors on issues is already a task for release managers. Extending > > this responsibility to also check the regression tests seems natural. > > > > Yanfei's write-up is a good proposal for general performance test > > guidelines. It will help contributors and release managers alike. We > could > > integrate it into the release testing documentation [1]. I'm wondering > > whether we would want to add a new Jira label to group these kinds of > Jira > > issues analogously to test-instabilities [2]. > > > > But that said, I want to repeat the idea of organizing the release > > management documentation in a way that we have not only general release > > managers with a bunch of different tasks but dedicated roles within the > > release management. Roles I could imagine based on our experience from > the > > past releases are: > > - CI monitoring: Watching the #builds Slack channel for test > instabilities > > - Regression test monitoring: Watching the #flink-dev-benchmarks for > > regressions > > - Coordination: Announcements, documentation, organizational stuff (e.g. > > release call) > > > > Having dedicated roles might help finding volunteers for the release > > management. Volunteers could sign up for a dedicated role. Each role > would > > provide a clear set of tasks/responsibilities. They don't restrict us in > > any way because multiple roles can be also fulfilled by a single person. > > It's just about cutting the tasks into smaller chunks to make it less > > overwhelming. > > > > Matthias > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/Release+Management+and+Feature+Plan > > [2] https://cwiki.apache.org/confluence/display/FLINK/Flink+Jira+Process > > > > On Tue, Jan 31, 2023 at 5:16 PM Jing Ge <j...@ververica.com.invalid> > wrote: > > > > > Hi Yanfei, > > > > > > Thanks for your proposal and effort of driving it. > > > > > > I really like the document on how to deal with the performance > regressions. > > > This will coach more developers to be able to work with it. I would > suggest > > > that more developers will be aware of the performance regressions > during > > > the daily development, not only for release managers. The document > like you > > > drafted is the key to get them involved. It would be great if the > document > > > could be extended and cover all types of regressions, case by case. It > will > > > be a one-time effort and benefit for a long time. > > > > > > Best regards, > > > Jing > > > > > > On Tue, Jan 31, 2023 at 12:10 PM Piotr Nowojski <pnowoj...@apache.org> > > > wrote: > > > > > > > Hi, > > > > > > > > @Dong, I think the drawback of your proposal is that it wouldn't > detect > > > if > > > > there is visible performance regression within benchmark noise. While > > > this > > > > should be do-able with large enough number of samples > > > > > > > > @Yanfei, Gathering medians would have basically the same problems > with > > > how > > > > to deal with two consecutive performance changes. Another issue is > that I > > > > think it would be great to have an algorithm, where you can enter > what's > > > > your desired confidence interval for the detections as the input. > This > > > way > > > > we might be able to take care generically of all benchmarks, > regardless > > > how > > > > noisy they are. In your proposal, if you set the threshold 5% on the > > > median > > > > comparison, the most noisy benchmarks might still give too many false > > > > positives, while with less noisy benchmarks we would be unnecessarily > > > > ignoring small regressions. > > > > > > > > I was also informed about some tooling created exactly for detecting > > > > performance regressions from benchmark results: > > > > > > > > > fork of Hunter - a perf change detection tool, originally from > > > DataStax: > > > > > Blog post - > > > > > > > > > > > > https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4 > > > > > Paper - https://arxiv.org/pdf/2301.03034.pdf > > > > > Our fork - https://github.com/gerrrr/hunter > > > > > > > > The algorithm that's used underneath "E-divisive Means" sounds > promising. > > > > > > > > Best, > > > > Piotrek > > > > > > > > wt., 31 sty 2023 o 09:56 Yanfei Lei <fredia...@gmail.com> > napisał(a): > > > > > > > > > Thanks for all the feedback and suggestions. > > > > > > > > > > @Piotr: > > > > > >> I was setting the priority to a blocker and I would propose to > add > > > > this > > > > > to the instructions and general convention. > > > > > > > > > > Thanks for sharing your experience, I will update this to the > document. > > > > > And your suggestion of leveraging Two-sample Kolmogorov-Smirnov > test > > > > > to detect regressions in FLINK-29825[1] is a great idea, I'm > curious > > > > > about two things: > > > > > > > > > > (1) it's not easy to construct the first distribution: "all latest > > > > > samples, from the latest one (1st one), until N'th latest sample". > If > > > > > some regressions have occurred before, it will "distort" the first > > > > > distribution, possibly leading to false positive; meanwhile, > > > > > optimization can also cause "distortion" which possibly leads to > false > > > > > negative, It's relatively acceptable. Maybe we can filter out > outliers > > > > > to avoid this problem. > > > > > > > > > > (2) Also, Kolmogorov-Smirnov test is an absolute value, and it is > > > > > impossible to identify whether it is an improvement or a > regression. > > > > > > > > > > My plan is/was to use the error to adjust each threshold, each > > > > > benchmark run has a "Score Error (99.9%)" value in the result[2], > let > > > > > threshold = 0.75*(max error of N'th latest sample), and the other > > > > > logic is the same as the existing detection script. > > > > > > > > > > > > > > > @Dong Lin: > > > > > The current detection script[3] is similar to the > > > > > average-throughput-deviation method you described. We use the > median > > > > > of the last 100 days as a benchmark, and then compare the median of > > > > > the last 20 days, and this script may delay the regression alerts. > It > > > > > is a good way to detect by deviation, currently each benchmark run > has > > > > > a "Score Error (99.9%)" value in the result, maybe we can also > > > > > calculate "Score Error (99.9%)" into the deviation. > > > > > > > > > > In addition, due to the existence of regression/improvement, the > > > > > median of the baseline is also unstable, I think RobustScaler[4] of > > > > > flink ML is very suitable here, maybe RobustScaler can be used to > > > > > calculate the median of baseline. > > > > > The problem faced by this automatic detection tool is very similar > to > > > > > novelty and outlier detection problems, maybe some methods in > > > > > Sklearn[5] can be considered. > > > > > > > > > > [1] > > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > > > [2] > > > > > > > > > > > > > http://codespeed.dak8s.net:8080/job/flink-benchmark-request/lastSuccessfulBuild/artifact/jmh-result.csv/*view*/ > > > > > [3] > > > > > > > > > > > > > https://github.com/apache/flink-benchmarks/blob/master/regression_report.py > > > > > [4] > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-ml-docs-master/docs/operators/feature/robustscaler/ > > > > > [5] https://scikit-learn.org/stable/modules/outlier_detection.html > > > > > > > > > > > > > > > Dong Lin <lindon...@gmail.com> 于2023年1月31日周二 10:23写道: > > > > > > > > > > > > Hi Piotr, > > > > > > > > > > > > Yes, the challenge of developing such an automatic tool is > indeed to > > > > > handle > > > > > > noise and achieve a balance between false positive and false > > > negative. > > > > It > > > > > > is great to know that we already have scripts that can access > > > > historical > > > > > > benchmark data and generate alerts. > > > > > > > > > > > > There is some heuristic that we can use to considerably reduce > the > > > > false > > > > > > positive alert. For example, we can measure the > > > > > > average-throughput-deviation of such a benchmark over 5 runs and > only > > > > > > suppress alert for benchmarks whose deviation is too high (i.e. > the > > > > > > benchmark is too noisy and requires tuning or removal). And we > can > > > > > compare > > > > > > the deviation-from-last-commit with the > average-throughput-deviation > > > of > > > > > > this benchmark and report error if these two values differ to > much > > > > (i.e. > > > > > > the performance regression is observably higher than its typical > > > > noise). > > > > > > > > > > > > I implemented this benchmark framework > > > > > > <https://github.com/tensorflow/benchmarks/tree/master/perfzero> > for > > > > > > TensorFlow three years ago and re-designed a regression detection > > > > > algorithm > > > > > > internally at Google using the heuristics described above. I > recall > > > > that > > > > > > the script worked reasonably well and most of the regression > alert is > > > > > > actionable. Maybe I can take a shot at implementing a similar > > > algorithm > > > > > for > > > > > > Flink benchmark sometime later and see how it works. > > > > > > > > > > > > Anyway, the development of the algorithm is probably not the > focus of > > > > > this > > > > > > topic. +1 for incorporating performance regression monitoring > into > > > > > routine > > > > > > process. > > > > > > > > > > > > Best, > > > > > > Dong > > > > > > > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:56 PM Piotr Nowojski < > > > pnowoj...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Hi Dong, > > > > > > > > > > > > > > The main issue with an automatic tool at the moment is that > some > > > > > benchmarks > > > > > > > are quite noisy and performance regressions are often within > the > > > > noise > > > > > of a > > > > > > > given benchmark. Our currently existing tooling can not handle > > > those > > > > > cases. > > > > > > > Until we address this issue, I think it will have to remain a > > > manual > > > > > > > process. There is a ticket mentioned by Yuan [1] where I have > > > > written a > > > > > > > comment and a proposal on how to improve the automatic > performance > > > > > > > regression detection. > > > > > > > > > > > > > > Best, > > > > > > > Piotrek > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29825?focusedCommentId=17679077&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17679077 > > > > > > > > > > > > > > > > > > > > > pon., 30 sty 2023 o 15:31 Dong Lin <lindon...@gmail.com> > > > napisał(a): > > > > > > > > > > > > > > > Hi Yanfei, > > > > > > > > > > > > > > > > Thanks for driving the benchmark monitoring effort! The > Google > > > doc > > > > > and > > > > > > > the > > > > > > > > community wiki looks pretty good. > > > > > > > > > > > > > > > > According to Yuan's comment, it seems that we currently > manually > > > > > watch > > > > > > > the > > > > > > > > benchmark results to detect regression. Have we considered > > > > automating > > > > > > > this > > > > > > > > process by e.g. exporting the nightly benchmark results to a > > > > > database and > > > > > > > > using scripts to detect regression based on pre-defined > rules? > > > > > > > > > > > > > > > > This approach is probably more scalable and accurate in the > long > > > > > term. > > > > > > > And > > > > > > > > I had a good experience working with such a regression > detection > > > > > tool in > > > > > > > my > > > > > > > > past job. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Dong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 19, 2023 at 4:02 PM Yanfei Lei < > fredia...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > > > I'd like to start a discussion about incorporating > performance > > > > > > > > > regression monitoring into the routine process. Flink > > > benchmarks > > > > > are > > > > > > > > > periodically executed on http://codespeed.dak8s.net:8080 > to > > > > > monitor > > > > > > > > > Flink performance. In late Oct'22, a new slack channel > > > > > > > > > #flink-dev-benchmarks was created for notifications of > > > > performance > > > > > > > > > regressions. It helped us find 2 build failures[1,2] and 5 > > > > > performance > > > > > > > > > regressions[3,4,5,6,7] in the past 3 months, which is very > > > > > meaningful > > > > > > > > > to ensuring the quality of the code. > > > > > > > > > > > > > > > > > > There are some release managers( cc @Matthias, @Martijn, > > > > > @Qingsheng) > > > > > > > > > proposing to incorporate performance regression monitoring > into > > > > the > > > > > > > > > release management, I think it makes sense for performance > > > > > stabilities > > > > > > > > > (like CI stabilities), since almost every release has some > > > > tickets > > > > > > > > > about performance optimizations, the performance > monitoring can > > > > > > > > > effectively avoid performance regression and track the > > > > performance > > > > > > > > > improvement of each release. So I start this discussion to > pick > > > > > > > > > everyone’s brain for some suggestions. > > > > > > > > > > > > > > > > > > In the past, I checked the slack notifications once a week, > > > and I > > > > > have > > > > > > > > > summarized a draft[8]( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > > > ) > > > > > > > > > on how to deal with performance regressions according to > some > > > > > > > > > contributors and my own experience. If the above proposal > is > > > > > > > > > considered acceptable, I’d like to put it in the community > > > > wiki[9]. > > > > > > > > > > > > > > > > > > Looking forward to your feedback! > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29883 > > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-30015 > > > > > > > > > [3] https://issues.apache.org/jira/browse/FLINK-29886 > > > > > > > > > [4] https://issues.apache.org/jira/browse/FLINK-30181 > > > > > > > > > [5] https://issues.apache.org/jira/browse/FLINK-30623 > > > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-30624 > > > > > > > > > [7] https://issues.apache.org/jira/browse/FLINK-30625 > > > > > > > > > [8] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing > > > > > > > > > [9] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847 > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Yanfei > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, > > > > > Yanfei > > > > > > > > > > > > >