Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Yu Li Thu, 24 Jun 2021 23:10:58 -0700

+1 for Xintong's proposal.

For me, resolving problems directly (fixing the infrastructure issue,
disabling unstable tests and creating blocker JIRAs to track the fix and
re-enable them asap, etc.) is (in most cases) better than working around
them (verify locally, manually check and judge the failure as "unrelated",
etc.), and I believe the proposal could help us pushing those more "real"
solutions forward.


Best Regards,
Yu


On Fri, 25 Jun 2021 at 10:58, Yangze Guo <karma...@gmail.com> wrote:

> Creating a blocker issue for the manually disabled tests sounds good to me.
>
> Minor: I'm still a bit worried about the commits merged before we fix
> the unstable tests can also break those tests. Instead of letting the
> assigners keep a look at all potentially related commits, they can
> maintain a branch that is periodically synced with the master branch
> while enabling the unstable test. So that they can catch the breaking
> changes asap.
>
> Best,
> Yangze Guo
>
> On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
> >
> > I like the idea of creating a blocker issue for a disabled test. This
> will
> > force us to resolve it in a timely manner and it won't fall through the
> > cracks.
> >
> > Cheers,
> > Till
> >
> > On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li <jingsongl...@gmail.com>
> wrote:
> >
> > > +1 to Xintong's proposal
> > >
> > > I also have some concerns about unstable cases.
> > >
> > > I think unstable cases can be divided into these types:
> > >
> > > - Force majeure: For example, network timeout, sudden environmental
> > > collapse, they are accidental and can always be solved by triggering
> azure
> > > again. Committers should wait for the next green azure.
> > >
> > > - Obvious mistakes: For example, some errors caused by obvious reasons
> may
> > > be repaired quickly. At this time, do we need to wait, or not wait and
> just
> > > ignore?
> > >
> > > - Difficult questions: These problems are very difficult to find. There
> > > will be no solution for a while and a half. We don't even know the
> reason.
> > > At this time, we should ignore it. (Maybe it's judged by the author of
> the
> > > case. But what about the old case whose author can't be found?)
> > >
> > > So, the ignored cases should be the block of the next release until the
> > > reason is found or the case is fixed?  We need to ensure that someone
> will
> > > take care of these cases, because there is no deepening of failed
> tests, no
> > > one may continue to pay attention to these cases.
> > >
> > > I think this guideline should consider these situations, and show how
> to
> > > solve them.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <imj...@gmail.com> wrote:
> > >
> > > > Thanks to Xintong for bringing up this topic, I'm +1 in general.
> > > >
> > > > However, I think it's still not very clear how we address the
> unstable
> > > > tests.
> > > > I think this is a very important part of this new guideline.
> > > >
> > > > According to the discussion above, if some tests are unstable, we can
> > > > manually disable it.
> > > > But I have some questions in my mind:
> > > > 1) Is the instability judged by the committer themselves or by some
> > > > metrics?
> > > > 2) Should we log the disable commit in the corresponding issue and
> > > increase
> > > > the priority?
> > > > 3) What if nobody looks into this issue and this becomes some
> potential
> > > > bugs released with the new version?
> > > > 4) If no person is actively working on the issue, who should
> re-enable
> > > it?
> > > > Would it block PRs again?
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > > On Thu, 24 Jun 2021 at 10:04, Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks all for the feedback.
> > > > >
> > > > > @Till @Yangze
> > > > >
> > > > > I'm also not convinced by the idea of having an exception for local
> > > > builds.
> > > > > We need to execute the entire build (or at least the failing stage)
> > > > > locally, to make sure subsequent test cases prevented by the
> failure
> > > one
> > > > > are all executed. In that case, it's probably easier to rerun the
> build
> > > > on
> > > > > azure than locally.
> > > > >
> > > > > Concerning disabling unstable test cases that regularly block PRs
> from
> > > > > merging, maybe we can say that such cases can only be disabled when
> > > > someone
> > > > > is actively looking into it, likely the person who disabled the
> case.
> > > If
> > > > > this person is no longer actively working on it, he/she should
> enable
> > > the
> > > > > case again no matter if it is fixed or not.
> > > > >
> > > > > @Jing
> > > > >
> > > > > Thanks for the suggestions.
> > > > >
> > > > > +1 to provide guidelines on handling test failures.
> > > > >
> > > > > 1. Report the test failures in the JIRA.
> > > > > >
> > > > >
> > > > > +1 on this. Currently, the release managers are monitoring the ci
> and
> > > > cron
> > > > > build instabilities and reporting them on JIRA. We should also
> > > encourage
> > > > > other contributors to do that for PRs.
> > > > >
> > > > > 2. Set a deadline to find out the root cause and solve the failure
> for
> > > > the
> > > > > > new created JIRA  because we could not block other commit merges
> for
> > > a
> > > > > long
> > > > > > time
> > > > > >
> > > > > 3. What to do if the JIRA has not made significant progress when
> > > reached
> > > > to
> > > > > > the deadline time?
> > > > >
> > > > >
> > > > > I'm not sure about these two. It feels a bit against the voluntary
> > > nature
> > > > > of open source projects.
> > > > >
> > > > > IMHO, frequent instabilities are more likely to be upgraded to the
> > > > critical
> > > > > / blocker priority, receive more attention and eventually get
> fixed.
> > > > > Release managers are also responsible for looking for assignees for
> > > such
> > > > > issues. If a case is still not fixed soonish, even with all these
> > > > efforts,
> > > > > I'm not sure how setting a deadline can help this.
> > > > >
> > > > > 4. If we disable the respective tests temporarily, we also need a
> > > > mechanism
> > > > > > to ensure the issue would be continued to be investigated in the
> > > > future.
> > > > > >
> > > > >
> > > > > +1. As mentioned above, we may consider disabling such tests iff
> > > someone
> > > > is
> > > > > actively working on it.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 23, 2021 at 9:56 PM JING ZHANG <beyond1...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Xintong,
> > > > > > +1 to the proposal.
> > > > > > In order to better comply with the rule, it is necessary to
> describe
> > > > > what's
> > > > > > best practice if encountering test failure which seems unrelated
> with
> > > > the
> > > > > > current commits.
> > > > > > How to avoid merging PR with test failures and not blocking code
> > > > merging
> > > > > > for a long time?
> > > > > > I tried to think about the possible steps, and found there are
> some
> > > > > > detailed problems that need to be discussed in a step further:
> > > > > > 1. Report the test failures in the JIRA.
> > > > > > 2. Set a deadline to find out the root cause and solve the
> failure
> > > for
> > > > > the
> > > > > > new created JIRA  because we could not block other commit merges
> for
> > > a
> > > > > long
> > > > > > time
> > > > > >     When is a reasonable deadline here?
> > > > > > 3. What to do if the JIRA has not made significant progress when
> > > > reached
> > > > > to
> > > > > > the deadline time?
> > > > > >     There are several situations as follows, maybe different
> cases
> > > need
> > > > > > different approaches.
> > > > > >     1. the JIRA is non-assigned yet
> > > > > >     2. not found the root cause yet
> > > > > >     3. not found a good solution, but already found the root
> cause
> > > > > >     4. found a solution, but it needs more time to be done.
> > > > > > 4. If we disable the respective tests temporarily, we also need a
> > > > > mechanism
> > > > > > to ensure the issue would be continued to be investigated in the
> > > > future.
> > > > > >
> > > > > > Best regards,
> > > > > > JING ZHANG
> > > > > >
> > > > > > Stephan Ewen <se...@apache.org> 于2021年6月23日周三 下午8:16写道：
> > > > > >
> > > > > > > +1 to Xintong's proposal
> > > > > > >
> > > > > > > On Wed, Jun 23, 2021 at 1:53 PM Till Rohrmann <
> > > trohrm...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I would first try to not introduce the exception for local
> > > builds.
> > > > It
> > > > > > > makes
> > > > > > > > it quite hard for others to verify the build and to make sure
> > > that
> > > > > the
> > > > > > > > right things were executed. If we see that this becomes an
> issue
> > > > then
> > > > > > we
> > > > > > > > can revisit this idea.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo <
> karma...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 for appending this to community guidelines for merging
> PRs.
> > > > > > > > >
> > > > > > > > > @Till Rohrmann
> > > > > > > > > I agree that with this approach unstable tests will not
> block
> > > > other
> > > > > > > > > commit merges. However, it might be hard to prevent merging
> > > > commits
> > > > > > > > > that are related to those tests and should have been passed
> > > them.
> > > > > > It's
> > > > > > > > > true that this judgment can be made by the committers, but
> no
> > > one
> > > > > can
> > > > > > > > > ensure the judgment is always precise and so that we have
> this
> > > > > > > > > discussion thread.
> > > > > > > > >
> > > > > > > > > Regarding the unstable tests, how about adding another
> > > exception:
> > > > > > > > > committers verify it in their local environment and
> comment in
> > > > such
> > > > > > > > > cases?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Yangze Guo
> > > > > > > > >
> > > > > > > > > On Tue, Jun 22, 2021 at 8:23 PM 刘建刚 <
> liujiangangp...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > It is a good principle to run all tests successfully
> with any
> > > > > > change.
> > > > > > > > > This
> > > > > > > > > > means a lot for project's stability and development. I
> am big
> > > > +1
> > > > > > for
> > > > > > > > this
> > > > > > > > > > proposal.
> > > > > > > > > >
> > > > > > > > > > Best
> > > > > > > > > > liujiangang
> > > > > > > > > >
> > > > > > > > > > Till Rohrmann <trohrm...@apache.org> 于2021年6月22日周二
> 下午6:36写道：
> > > > > > > > > >
> > > > > > > > > > > One way to address the problem of regularly failing
> tests
> > > > that
> > > > > > > block
> > > > > > > > > > > merging of PRs is to disable the respective tests for
> the
> > > > time
> > > > > > > being.
> > > > > > > > > Of
> > > > > > > > > > > course, the failing test then needs to be fixed. But at
> > > least
> > > > > > that
> > > > > > > > way
> > > > > > > > > we
> > > > > > > > > > > would not block everyone from making progress.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Till
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise <
> > > > ar...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I think this is overall a good idea. So +1 from my
> side.
> > > > > > > > > > > > However, I'd like to put a higher priority on
> > > > infrastructure
> > > > > > > then,
> > > > > > > > in
> > > > > > > > > > > > particular docker image/artifact caches.
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann <
> > > > > > > > trohrm...@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for bringing this topic to our attention
> > > Xintong.
> > > > I
> > > > > > > think
> > > > > > > > > your
> > > > > > > > > > > > > proposal makes a lot of sense and we should follow
> it.
> > > It
> > > > > > will
> > > > > > > > > give us
> > > > > > > > > > > > > confidence that our changes are working and it
> might
> > > be a
> > > > > > good
> > > > > > > > > > > incentive
> > > > > > > > > > > > to
> > > > > > > > > > > > > quickly fix build instabilities. Hence, +1.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Till
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song <
> > > > > > > > > tonysong...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the past a couple of weeks, I've observed
> several
> > > > > times
> > > > > > > that
> > > > > > > > > PRs
> > > > > > > > > > > are
> > > > > > > > > > > > > > merged without a green light from the CI tests,
> where
> > > > > > failure
> > > > > > > > > cases
> > > > > > > > > > > are
> > > > > > > > > > > > > > considered *unrelated*. This may not always cause
> > > > > problems,
> > > > > > > but
> > > > > > > > > would
> > > > > > > > > > > > > > increase the chance of breaking our code base. In
> > > fact,
> > > > > it
> > > > > > > has
> > > > > > > > > > > occurred
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > me twice in the past few weeks that I had to
> revert a
> > > > > > commit
> > > > > > > > > which
> > > > > > > > > > > > breaks
> > > > > > > > > > > > > > the master branch due to this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think it would be nicer to enforce a stricter
> rule,
> > > > > that
> > > > > > no
> > > > > > > > PRs
> > > > > > > > > > > > should
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > merged without passing CI.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The problems of merging PRs with "unrelated" test
> > > > > failures
> > > > > > > are:
> > > > > > > > > > > > > > - It's not always straightforward to tell
> whether a
> > > > test
> > > > > > > > > failures are
> > > > > > > > > > > > > > related or not.
> > > > > > > > > > > > > > - It prevents subsequent test cases from being
> > > > executed,
> > > > > > > which
> > > > > > > > > may
> > > > > > > > > > > fail
> > > > > > > > > > > > > > relating to the PR changes.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To make things easier for the committers, the
> > > following
> > > > > > > > > exceptions
> > > > > > > > > > > > might
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > considered acceptable.
> > > > > > > > > > > > > > - The PR has passed CI in the contributor's
> personal
> > > > > > > workspace.
> > > > > > > > > > > Please
> > > > > > > > > > > > > post
> > > > > > > > > > > > > > the link in such cases.
> > > > > > > > > > > > > > - The CI tests have been triggered multiple
> times, on
> > > > the
> > > > > > > same
> > > > > > > > > > > commit,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > each stage has at least passed for once. Please
> also
> > > > > > comment
> > > > > > > in
> > > > > > > > > such
> > > > > > > > > > > > > cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If we all agree on this, I'd update the community
> > > > > > guidelines
> > > > > > > > for
> > > > > > > > > > > > merging
> > > > > > > > > > > > > > PRs wrt. this proposal. [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please let me know what do you think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
>

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Reply via email to