One quick note for clarification. I don't have anything against builds running on your personal Azure account and this is not what I understood under "local environment". For me "local environment" means that someone runs the test locally on his machine and then says that the tests have passed locally.
I do agree that there might be a conflict of interests if a PR author disables tests. Here I would argue that we don't have malignant committers which means that every committer will probably first check the respective ticket for how often the test failed. Then I guess the next step would be to discuss on the ticket whether to disable it or not. And finally, after reaching a consensus, it will be disabled. If we see someone abusing this policy, then we can still think about how to guard against it. But, honestly, I have very rarely seen such a case. I am also ok to pull in the release manager to make the final call if this resolves concerns. Cheers, Till On Fri, Jun 25, 2021 at 9:07 AM Piotr Nowojski <pnowoj...@apache.org> wrote: > +1 for the general idea, however I have concerns about a couple of details. > > > I would first try to not introduce the exception for local builds. > > It makes it quite hard for others to verify the build and to make sure > that the right things were executed. > > I would counter Till's proposal to ignore local green builds. If committer > is merging and closing a PR, with official azure failure, but there was a > green build before or in local azure it's IMO enough to leave the message: > > > Latest build failure is a known issue: FLINK-12345 > > Green local build: URL > > This should address Till's concern about verification. > > On the other hand I have concerns about disabling tests.* It shouldn't be > the PR author/committer that's disabling a test on his own, as that's a > conflict of interests*. I have however no problems with disabling test > instabilities that were marked as "blockers" though, that should work > pretty well. But the important thing here is to correctly judge bumping > priorities of test instabilities based on their frequency and current > general health of the system. I believe that release managers should be > playing a big role here in deciding on the guidelines of what should be a > priority of certain test instabilities. > > What I mean by that is two example scenarios: > 1. if we have a handful of very frequently failing tests and a handful of > very rarely failing tests (like one reported failure and no another > occurrence in many months, and let's even say that the failure looks like > infrastructure/network timeout), we should focus on the frequently failing > ones, and probably we are safe to ignore for the time being the rare issues > - at least until we deal with the most pressing ones. > 2. If we have tons of rarely failing test instabilities, we should probably > start addressing them as blockers. > > I'm using my own conscious and my best judgement when I'm > bumping/decreasing priorities of test instabilities (and bugs), but as > individual committer I don't have the full picture. As I wrote above, I > think release managers are in a much better position to keep adjusting > those kind of guidelines. > > Best, Piotrek > > pt., 25 cze 2021 o 08:10 Yu Li <car...@gmail.com> napisał(a): > > > +1 for Xintong's proposal. > > > > For me, resolving problems directly (fixing the infrastructure issue, > > disabling unstable tests and creating blocker JIRAs to track the fix and > > re-enable them asap, etc.) is (in most cases) better than working around > > them (verify locally, manually check and judge the failure as > "unrelated", > > etc.), and I believe the proposal could help us pushing those more "real" > > solutions forward. > > > > Best Regards, > > Yu > > > > > > On Fri, 25 Jun 2021 at 10:58, Yangze Guo <karma...@gmail.com> wrote: > > > > > Creating a blocker issue for the manually disabled tests sounds good to > > me. > > > > > > Minor: I'm still a bit worried about the commits merged before we fix > > > the unstable tests can also break those tests. Instead of letting the > > > assigners keep a look at all potentially related commits, they can > > > maintain a branch that is periodically synced with the master branch > > > while enabling the unstable test. So that they can catch the breaking > > > changes asap. > > > > > > Best, > > > Yangze Guo > > > > > > On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > > I like the idea of creating a blocker issue for a disabled test. This > > > will > > > > force us to resolve it in a timely manner and it won't fall through > the > > > > cracks. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li <jingsongl...@gmail.com> > > > wrote: > > > > > > > > > +1 to Xintong's proposal > > > > > > > > > > I also have some concerns about unstable cases. > > > > > > > > > > I think unstable cases can be divided into these types: > > > > > > > > > > - Force majeure: For example, network timeout, sudden environmental > > > > > collapse, they are accidental and can always be solved by > triggering > > > azure > > > > > again. Committers should wait for the next green azure. > > > > > > > > > > - Obvious mistakes: For example, some errors caused by obvious > > reasons > > > may > > > > > be repaired quickly. At this time, do we need to wait, or not wait > > and > > > just > > > > > ignore? > > > > > > > > > > - Difficult questions: These problems are very difficult to find. > > There > > > > > will be no solution for a while and a half. We don't even know the > > > reason. > > > > > At this time, we should ignore it. (Maybe it's judged by the author > > of > > > the > > > > > case. But what about the old case whose author can't be found?) > > > > > > > > > > So, the ignored cases should be the block of the next release until > > the > > > > > reason is found or the case is fixed? We need to ensure that > someone > > > will > > > > > take care of these cases, because there is no deepening of failed > > > tests, no > > > > > one may continue to pay attention to these cases. > > > > > > > > > > I think this guideline should consider these situations, and show > how > > > to > > > > > solve them. > > > > > > > > > > Best, > > > > > Jingsong > > > > > > > > > > On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <imj...@gmail.com> wrote: > > > > > > > > > > > Thanks to Xintong for bringing up this topic, I'm +1 in general. > > > > > > > > > > > > However, I think it's still not very clear how we address the > > > unstable > > > > > > tests. > > > > > > I think this is a very important part of this new guideline. > > > > > > > > > > > > According to the discussion above, if some tests are unstable, we > > can > > > > > > manually disable it. > > > > > > But I have some questions in my mind: > > > > > > 1) Is the instability judged by the committer themselves or by > some > > > > > > metrics? > > > > > > 2) Should we log the disable commit in the corresponding issue > and > > > > > increase > > > > > > the priority? > > > > > > 3) What if nobody looks into this issue and this becomes some > > > potential > > > > > > bugs released with the new version? > > > > > > 4) If no person is actively working on the issue, who should > > > re-enable > > > > > it? > > > > > > Would it block PRs again? > > > > > > > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > > > > > > > On Thu, 24 Jun 2021 at 10:04, Xintong Song < > tonysong...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Thanks all for the feedback. > > > > > > > > > > > > > > @Till @Yangze > > > > > > > > > > > > > > I'm also not convinced by the idea of having an exception for > > local > > > > > > builds. > > > > > > > We need to execute the entire build (or at least the failing > > stage) > > > > > > > locally, to make sure subsequent test cases prevented by the > > > failure > > > > > one > > > > > > > are all executed. In that case, it's probably easier to rerun > the > > > build > > > > > > on > > > > > > > azure than locally. > > > > > > > > > > > > > > Concerning disabling unstable test cases that regularly block > PRs > > > from > > > > > > > merging, maybe we can say that such cases can only be disabled > > when > > > > > > someone > > > > > > > is actively looking into it, likely the person who disabled the > > > case. > > > > > If > > > > > > > this person is no longer actively working on it, he/she should > > > enable > > > > > the > > > > > > > case again no matter if it is fixed or not. > > > > > > > > > > > > > > @Jing > > > > > > > > > > > > > > Thanks for the suggestions. > > > > > > > > > > > > > > +1 to provide guidelines on handling test failures. > > > > > > > > > > > > > > 1. Report the test failures in the JIRA. > > > > > > > > > > > > > > > > > > > > > > +1 on this. Currently, the release managers are monitoring the > ci > > > and > > > > > > cron > > > > > > > build instabilities and reporting them on JIRA. We should also > > > > > encourage > > > > > > > other contributors to do that for PRs. > > > > > > > > > > > > > > 2. Set a deadline to find out the root cause and solve the > > failure > > > for > > > > > > the > > > > > > > > new created JIRA because we could not block other commit > > merges > > > for > > > > > a > > > > > > > long > > > > > > > > time > > > > > > > > > > > > > > > 3. What to do if the JIRA has not made significant progress > when > > > > > reached > > > > > > to > > > > > > > > the deadline time? > > > > > > > > > > > > > > > > > > > > > I'm not sure about these two. It feels a bit against the > > voluntary > > > > > nature > > > > > > > of open source projects. > > > > > > > > > > > > > > IMHO, frequent instabilities are more likely to be upgraded to > > the > > > > > > critical > > > > > > > / blocker priority, receive more attention and eventually get > > > fixed. > > > > > > > Release managers are also responsible for looking for assignees > > for > > > > > such > > > > > > > issues. If a case is still not fixed soonish, even with all > these > > > > > > efforts, > > > > > > > I'm not sure how setting a deadline can help this. > > > > > > > > > > > > > > 4. If we disable the respective tests temporarily, we also > need a > > > > > > mechanism > > > > > > > > to ensure the issue would be continued to be investigated in > > the > > > > > > future. > > > > > > > > > > > > > > > > > > > > > > +1. As mentioned above, we may consider disabling such tests > iff > > > > > someone > > > > > > is > > > > > > > actively working on it. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 23, 2021 at 9:56 PM JING ZHANG < > beyond1...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Xintong, > > > > > > > > +1 to the proposal. > > > > > > > > In order to better comply with the rule, it is necessary to > > > describe > > > > > > > what's > > > > > > > > best practice if encountering test failure which seems > > unrelated > > > with > > > > > > the > > > > > > > > current commits. > > > > > > > > How to avoid merging PR with test failures and not blocking > > code > > > > > > merging > > > > > > > > for a long time? > > > > > > > > I tried to think about the possible steps, and found there > are > > > some > > > > > > > > detailed problems that need to be discussed in a step > further: > > > > > > > > 1. Report the test failures in the JIRA. > > > > > > > > 2. Set a deadline to find out the root cause and solve the > > > failure > > > > > for > > > > > > > the > > > > > > > > new created JIRA because we could not block other commit > > merges > > > for > > > > > a > > > > > > > long > > > > > > > > time > > > > > > > > When is a reasonable deadline here? > > > > > > > > 3. What to do if the JIRA has not made significant progress > > when > > > > > > reached > > > > > > > to > > > > > > > > the deadline time? > > > > > > > > There are several situations as follows, maybe different > > > cases > > > > > need > > > > > > > > different approaches. > > > > > > > > 1. the JIRA is non-assigned yet > > > > > > > > 2. not found the root cause yet > > > > > > > > 3. not found a good solution, but already found the root > > > cause > > > > > > > > 4. found a solution, but it needs more time to be done. > > > > > > > > 4. If we disable the respective tests temporarily, we also > > need a > > > > > > > mechanism > > > > > > > > to ensure the issue would be continued to be investigated in > > the > > > > > > future. > > > > > > > > > > > > > > > > Best regards, > > > > > > > > JING ZHANG > > > > > > > > > > > > > > > > Stephan Ewen <se...@apache.org> 于2021年6月23日周三 下午8:16写道: > > > > > > > > > > > > > > > > > +1 to Xintong's proposal > > > > > > > > > > > > > > > > > > On Wed, Jun 23, 2021 at 1:53 PM Till Rohrmann < > > > > > trohrm...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I would first try to not introduce the exception for > local > > > > > builds. > > > > > > It > > > > > > > > > makes > > > > > > > > > > it quite hard for others to verify the build and to make > > sure > > > > > that > > > > > > > the > > > > > > > > > > right things were executed. If we see that this becomes > an > > > issue > > > > > > then > > > > > > > > we > > > > > > > > > > can revisit this idea. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo < > > > karma...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > +1 for appending this to community guidelines for > merging > > > PRs. > > > > > > > > > > > > > > > > > > > > > > @Till Rohrmann > > > > > > > > > > > I agree that with this approach unstable tests will not > > > block > > > > > > other > > > > > > > > > > > commit merges. However, it might be hard to prevent > > merging > > > > > > commits > > > > > > > > > > > that are related to those tests and should have been > > passed > > > > > them. > > > > > > > > It's > > > > > > > > > > > true that this judgment can be made by the committers, > > but > > > no > > > > > one > > > > > > > can > > > > > > > > > > > ensure the judgment is always precise and so that we > have > > > this > > > > > > > > > > > discussion thread. > > > > > > > > > > > > > > > > > > > > > > Regarding the unstable tests, how about adding another > > > > > exception: > > > > > > > > > > > committers verify it in their local environment and > > > comment in > > > > > > such > > > > > > > > > > > cases? > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 22, 2021 at 8:23 PM 刘建刚 < > > > liujiangangp...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > It is a good principle to run all tests successfully > > > with any > > > > > > > > change. > > > > > > > > > > > This > > > > > > > > > > > > means a lot for project's stability and development. > I > > > am big > > > > > > +1 > > > > > > > > for > > > > > > > > > > this > > > > > > > > > > > > proposal. > > > > > > > > > > > > > > > > > > > > > > > > Best > > > > > > > > > > > > liujiangang > > > > > > > > > > > > > > > > > > > > > > > > Till Rohrmann <trohrm...@apache.org> 于2021年6月22日周二 > > > 下午6:36写道: > > > > > > > > > > > > > > > > > > > > > > > > > One way to address the problem of regularly failing > > > tests > > > > > > that > > > > > > > > > block > > > > > > > > > > > > > merging of PRs is to disable the respective tests > for > > > the > > > > > > time > > > > > > > > > being. > > > > > > > > > > > Of > > > > > > > > > > > > > course, the failing test then needs to be fixed. > But > > at > > > > > least > > > > > > > > that > > > > > > > > > > way > > > > > > > > > > > we > > > > > > > > > > > > > would not block everyone from making progress. > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise < > > > > > > ar...@apache.org > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > I think this is overall a good idea. So +1 from > my > > > side. > > > > > > > > > > > > > > However, I'd like to put a higher priority on > > > > > > infrastructure > > > > > > > > > then, > > > > > > > > > > in > > > > > > > > > > > > > > particular docker image/artifact caches. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann < > > > > > > > > > > trohrm...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for bringing this topic to our attention > > > > > Xintong. > > > > > > I > > > > > > > > > think > > > > > > > > > > > your > > > > > > > > > > > > > > > proposal makes a lot of sense and we should > > follow > > > it. > > > > > It > > > > > > > > will > > > > > > > > > > > give us > > > > > > > > > > > > > > > confidence that our changes are working and it > > > might > > > > > be a > > > > > > > > good > > > > > > > > > > > > > incentive > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > quickly fix build instabilities. Hence, +1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song < > > > > > > > > > > > tonysong...@gmail.com> > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the past a couple of weeks, I've observed > > > several > > > > > > > times > > > > > > > > > that > > > > > > > > > > > PRs > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > merged without a green light from the CI > tests, > > > where > > > > > > > > failure > > > > > > > > > > > cases > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > considered *unrelated*. This may not always > > cause > > > > > > > problems, > > > > > > > > > but > > > > > > > > > > > would > > > > > > > > > > > > > > > > increase the chance of breaking our code > base. > > In > > > > > fact, > > > > > > > it > > > > > > > > > has > > > > > > > > > > > > > occurred > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > me twice in the past few weeks that I had to > > > revert a > > > > > > > > commit > > > > > > > > > > > which > > > > > > > > > > > > > > breaks > > > > > > > > > > > > > > > > the master branch due to this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think it would be nicer to enforce a > stricter > > > rule, > > > > > > > that > > > > > > > > no > > > > > > > > > > PRs > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > merged without passing CI. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The problems of merging PRs with "unrelated" > > test > > > > > > > failures > > > > > > > > > are: > > > > > > > > > > > > > > > > - It's not always straightforward to tell > > > whether a > > > > > > test > > > > > > > > > > > failures are > > > > > > > > > > > > > > > > related or not. > > > > > > > > > > > > > > > > - It prevents subsequent test cases from > being > > > > > > executed, > > > > > > > > > which > > > > > > > > > > > may > > > > > > > > > > > > > fail > > > > > > > > > > > > > > > > relating to the PR changes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To make things easier for the committers, the > > > > > following > > > > > > > > > > > exceptions > > > > > > > > > > > > > > might > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > considered acceptable. > > > > > > > > > > > > > > > > - The PR has passed CI in the contributor's > > > personal > > > > > > > > > workspace. > > > > > > > > > > > > > Please > > > > > > > > > > > > > > > post > > > > > > > > > > > > > > > > the link in such cases. > > > > > > > > > > > > > > > > - The CI tests have been triggered multiple > > > times, on > > > > > > the > > > > > > > > > same > > > > > > > > > > > > > commit, > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > each stage has at least passed for once. > Please > > > also > > > > > > > > comment > > > > > > > > > in > > > > > > > > > > > such > > > > > > > > > > > > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we all agree on this, I'd update the > > community > > > > > > > > guidelines > > > > > > > > > > for > > > > > > > > > > > > > > merging > > > > > > > > > > > > > > > > PRs wrt. this proposal. [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please let me know what do you think. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, Jingsong Lee > > > > > > > > > > >