To use the pain analogy, people seem to have really good painkillers and hence they somehow don't feel the pain already. ;)
The reality is that important and high quality tests will get fixed. Poor quality tests (low signal to noise ratio) might not get fixed and that's ok. I'm not opposed to marking the tests as release blockers as a starting point, but I'm saying it's fine if people triage them and decide they are not blockers. In fact, that has already happened in the past. Ismael On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax <mj...@apache.org> wrote: > I agree on the test gap argument. However, my worry is, if we don't > "force the pain", it won't get fixed at all. -- I also know, that we try > to find an working approach for many years... > > My take is that if we disable a test and file a non-blocking Jira, it's > basically the same as just deleting the test all together and never talk > about it again. -- I believe, this is not want we aim for, but we aim > for good test coverage and a way to get these test fixed? > > Thus IMHO we need some forcing function (either keep the tests and feel > the pain on every PR), or disable the test and file a blocker JIRA so > the pain surfaces on a release forcing us to do something about it. > > If there is no forcing function, it basically means we are willing to > accept test gaps forever. > > > -Matthias > > On 11/14/23 9:09 PM, Ismael Juma wrote: > > Matthias, > > > > Flaky tests are worse than useless. I know engineers find it hard to > > disable them because of the supposed test gap (I find it hard too), but > the > > truth is that the test gap is already there! No-one blocks merging PRs on > > flaky tests, but they do get used to ignoring build failures. > > > > The current approach has been attempted for nearly a decade and it has > > never worked. I think we should try something different. > > > > When it comes to marking flaky tests as release blockers, I don't think > > this should be done as a general rule. We should instead assess on a case > > by case basis, same way we do it for bugs. > > > > Ismael > > > > On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax <mj...@apache.org> > wrote: > > > >> Thanks for starting this discussion David! I totally agree to "no"! > >> > >> I think there is no excuse whatsoever for merging PRs with compilation > >> errors (except an honest mistake for conflicting PRs that got merged > >> interleaved). -- Every committer must(!) check the Jenkins status before > >> merging to avoid such an issue. > >> > >> Similar for actual permanently broken tests. If there is no green build, > >> and the same test failed across multiple Jenkins runs, a committer > >> should detect this and cannot merge a PR. > >> > >> Given the current state of the CI pipeline, it seems possible to get > >> green runs, and thus I support the policy (that we actually always had) > >> to only merge if there is at least one green build. If committers got > >> sloppy about this, we need to call it out and put a hold on this > practice. > >> > >> (The only exception from the above policy would be a very unstable > >> status for which getting a green build is not possible at all, due to > >> too many flaky tests -- for this case, I would accept to merge even > >> there is no green build, but committer need to manually ensure that > >> every test did pass in at least one test run. -- We had this in the > >> past, but we I don't think we are in such a bad situation right now). > >> > >> About disabling tests: I was never a fan of this, because in my > >> experience those tests are not fixed any time soon. Especially, because > >> we do not consider such tickets as release blockers. To me, seeing tests > >> fails on PR build is actually a good forcing function for people to feel > >> the pain, and thus get motivated to make time to fix the tests. > >> > >> I have to admit that I was a little bit sloppy paying attention to flaky > >> tests recently, and I highly appreciate this effort. Also thanks to > >> everyone how actually filed a ticket! IMHO, we should file a ticket for > >> every flaky test, and also keep adding comments each time we see a test > >> fail to be able to track the frequency at which a tests fails, so we can > >> fix the most pressing ones first. > >> > >> Te me, the best forcing function to get test stabilized is to file > >> tickets and consider them release blockers. Disabling tests does not > >> really help much IMHO to tackle the problem (we can of course still > >> disable them to get noise out of the system, but it would only introduce > >> testing gaps for the time being and also does not help to figure out how > >> often a test fails, so it's not a solution to the problem IMHO) > >> > >> > >> -Matthias > >> > >> On 11/13/23 11:40 PM, Sagar wrote: > >>> Hi Divij, > >>> > >>> I think this proposal overall makes sense. My only nit sort of a > >> suggestion > >>> is that let's also consider a label called newbie++[1] for flaky tests > if > >>> we are considering adding newbie as a label. I think some of the flaky > >>> tests need familiarity with the codebase or the test setup so as a > first > >>> time contributor, it might be difficult. newbie++ IMO covers that > aspect. > >>> > >>> [1] > >>> > >> > https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22 > >>> > >>> Let me know what you think. > >>> > >>> Thanks! > >>> Sagar. > >>> > >>> On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya <divijvaidy...@gmail.com> > >>> wrote: > >>> > >>>>> Please, do it. > >>>> We can use specific labels to effectively filter those tickets. > >>>> > >>>> We already have a label and a way to discover flaky tests. They are > >> tagged > >>>> with the label "flaky-test" [1]. There is also a label "newbie" [2] > >> meant > >>>> for folks who are new to Apache Kafka code base. > >>>> My suggestion is to send a broader email to the community (since many > >> will > >>>> miss details in this thread) and call for action for committers to > >>>> volunteer as "shepherds" for these tickets. I can send one out once we > >> have > >>>> some consensus wrt next steps in this thread. > >>>> > >>>> > >>>> [1] > >>>> > >>>> > >> > https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC > >>>> > >>>> > >>>> [2] https://kafka.apache.org/contributing -> Finding a project to > work > >> on > >>>> > >>>> > >>>> Divij Vaidya > >>>> > >>>> > >>>> > >>>> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков <nizhi...@apache.org> > >>>> wrote: > >>>> > >>>>> > >>>>>> To kickstart this effort, we can publish a list of such tickets in > the > >>>>> community and assign one or more committers the role of a «shepherd" > >> for > >>>>> each ticket. > >>>>> > >>>>> Please, do it. > >>>>> We can use specific label to effectively filter those tickets. > >>>>> > >>>>>> 13 нояб. 2023 г., в 15:16, Divij Vaidya <divijvaidy...@gmail.com> > >>>>> написал(а): > >>>>>> > >>>>>> Thanks for bringing this up David. > >>>>>> > >>>>>> My primary concern revolves around the possibility that the > currently > >>>>>> disabled tests may remain inactive indefinitely. We currently have > >>>>>> unresolved JIRA tickets for flaky tests that have been pending for > an > >>>>>> extended period. I am inclined to support the idea of disabling > these > >>>>> tests > >>>>>> temporarily and merging changes only when the build is successful, > >>>>> provided > >>>>>> there is a clear plan for re-enabling them in the future. > >>>>>> > >>>>>> To address this issue, I propose the following measures: > >>>>>> > >>>>>> 1\ Foster a supportive environment for new contributors within the > >>>>>> community, encouraging them to take on tickets associated with flaky > >>>>> tests. > >>>>>> This initiative would require individuals familiar with the relevant > >>>> code > >>>>>> to offer guidance to those undertaking these tasks. Committers > should > >>>>>> prioritize reviewing and addressing these tickets within their > >>>> available > >>>>>> bandwidth. To kickstart this effort, we can publish a list of such > >>>>> tickets > >>>>>> in the community and assign one or more committers the role of a > >>>>> "shepherd" > >>>>>> for each ticket. > >>>>>> > >>>>>> 2\ Implement a policy to block minor version releases until the > >> Release > >>>>>> Manager (RM) is satisfied that the disabled tests do not result in > >> gaps > >>>>> in > >>>>>> our testing coverage. The RM may rely on Subject Matter Experts > (SMEs) > >>>> in > >>>>>> the specific code areas to provide assurance before giving the green > >>>>> light > >>>>>> for a release. > >>>>>> > >>>>>> 3\ Set a community-wide goal for 2024 to achieve a stable Continuous > >>>>>> Integration (CI) system. This goal should encompass projects such as > >>>>>> refining our test suite to eliminate flakiness and addressing > >>>>>> infrastructure issues if necessary. By publishing this goal, we > create > >>>> a > >>>>>> shared vision for the community in 2024, fostering alignment on our > >>>>>> objectives. This alignment will aid in prioritizing tasks for > >> community > >>>>>> members and guide reviewers in allocating their bandwidth > effectively. > >>>>>> > >>>>>> -- > >>>>>> Divij Vaidya > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan > >>>>> <jols...@confluent.io.invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> I will say that I have also seen tests that seem to be more flaky > >>>>>>> intermittently. It may be ok for some time and suddenly the CI is > >>>>>>> overloaded and we see issues. > >>>>>>> I have also seen the CI struggling with running out of space > >> recently, > >>>>> so I > >>>>>>> wonder if we can also try to improve things on that front. > >>>>>>> > >>>>>>> FWIW, I noticed, filed, or commented on several flaky test JIRAs > last > >>>>> week. > >>>>>>> I'm happy to try to get to green builds, but everyone needs to be > on > >>>>> board. > >>>>>>> > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15529 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-14806 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-14249 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15798 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15797 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15690 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15699 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15772 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15759 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15760 > >>>>>>> https://issues.apache.org/jira/browse/KAFKA-15700 > >>>>>>> > >>>>>>> I've also seen that kraft transactions tests often flakily see that > >>>> the > >>>>>>> producer id is not allocated and times out. > >>>>>>> I can file a JIRA for that too. > >>>>>>> > >>>>>>> Hopefully this is a place we can start from. > >>>>>>> > >>>>>>> Justine > >>>>>>> > >>>>>>> On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <m...@ismaeljuma.com> > >>>> wrote: > >>>>>>> > >>>>>>>> On Sat, Nov 11, 2023 at 10:32 AM John Roesler < > vvcep...@apache.org> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> In other words, I’m biased to think that new flakiness indicates > >>>>>>>>> non-deterministic bugs more often than it indicates a bad test. > >>>>>>>>> > >>>>>>>> > >>>>>>>> My experience is exactly the opposite. As someone who has tracked > >>>> many > >>>>> of > >>>>>>>> the flaky fixes, the vast majority of the time they are an issue > >> with > >>>>> the > >>>>>>>> test. > >>>>>>>> > >>>>>>>> Ismael > >>>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > >