Hi devs, I'd like to start a discussion about incorporating performance regression monitoring into the routine process. Flink benchmarks are periodically executed on http://codespeed.dak8s.net:8080 to monitor Flink performance. In late Oct'22, a new slack channel #flink-dev-benchmarks was created for notifications of performance regressions. It helped us find 2 build failures[1,2] and 5 performance regressions[3,4,5,6,7] in the past 3 months, which is very meaningful to ensuring the quality of the code.
There are some release managers( cc @Matthias, @Martijn, @Qingsheng) proposing to incorporate performance regression monitoring into the release management, I think it makes sense for performance stabilities (like CI stabilities), since almost every release has some tickets about performance optimizations, the performance monitoring can effectively avoid performance regression and track the performance improvement of each release. So I start this discussion to pick everyone’s brain for some suggestions. In the past, I checked the slack notifications once a week, and I have summarized a draft[8](https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing) on how to deal with performance regressions according to some contributors and my own experience. If the above proposal is considered acceptable, I’d like to put it in the community wiki[9]. Looking forward to your feedback! [1] https://issues.apache.org/jira/browse/FLINK-29883 [2] https://issues.apache.org/jira/browse/FLINK-30015 [3] https://issues.apache.org/jira/browse/FLINK-29886 [4] https://issues.apache.org/jira/browse/FLINK-30181 [5] https://issues.apache.org/jira/browse/FLINK-30623 [6] https://issues.apache.org/jira/browse/FLINK-30624 [7] https://issues.apache.org/jira/browse/FLINK-30625 [8] https://docs.google.com/document/d/1jTTJHoCTf8_LAjviyAY3Fi7p-tYtl_zw7rJKV4V6T_c/edit?usp=sharing [9] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847 Best, Yanfei