Good question. I agree that we should strive to fix flaky tests as top priority.

For more background, please read the document that was compiled about a year 
ago:
https://docs.google.com/document/d/10lmn4pW1IsT_8D1ZE0vMjASX0HhjdGdjB794iyScwns/edit

and these mailing list discussions:
https://lists.apache.org/thread/6ntz6gfmksonp901hhs7plf4yzok2qk1
https://lists.apache.org/thread/47j2l9n9q2d5fdrjvnfjr471b35lh8dk

To coordinating the flaky test fixing, I have been creating reports of the most 
flaky tests and posting the report to the mailing list every 1-2 months. The 
latest reporting email:
https://lists.apache.org/thread/25wf94k2gqxlb0rn7rrtsh03h7z6zlj0
Thanks to the active contributors, a lot of the most flaky tests have already 
been fixed after the report was created. There has been a lot of good progress 
in the last year.

I'd be happy to hear how we could improve our ways of working to address the 
flaky tests asap. More contributors and improvements ideas are very welcome! 

When you encounter a flaky test, please report it and fix it:
1) please search for an existing issues or search for all flaky issues with
"flaky" or the test class name (without package) in the search:
https://github.com/apache/pulsar/issues?q=is%3Aopen+flaky+sort%3Aupdated-desc

2) If there isn't an issue for a particular flaky test failure that you'd
like to fix, please create an issue using the "Flaky test" template at
https://github.com/apache/pulsar/issues/new/choose

3) Please comment on the issue that you are working on it.

Please join the #testing channel on Slack if you'd like to ask questions and 
share tips about
reproducing flaky tests locally and how to fix them. There's more work to be 
done in reducing flakiness. I'm looking forward to more contributors joining 
the efforts! Welcome!

-Lari

On 2022/03/09 01:47:40 Baozi wrote:
> Thank your reply. I agree with the idea mentioned above.
> 
> 
> However, I have a question: Why don't we fix all the flaky tests first? 
> After,  We can not allow all tests to retry. 
> 1. Fix all existing flaky tests.
> 2. Then, in each PR, for all tests, as long as they fail, let the job fail
> 3. If there are new or modified tests in this PR, it needs to be run N times
> 
> In this way, we can try our best to ensure that the tests in our master 
> branch are always stable.
> 
> 
> > 2022年3月9日 08:3306,PengHui Li <peng...@apache.org> 写道:
> > 
> >> Alternatively, would it be possible to only retry tests that are
> > annotated as "flaky"? This would exclude all new tests, by default,
> > and it'd make it clearer which tests need to be fixed.
> > 
> > +1,
> > 
> > If a test in other group which we can confirm it is a flaky test,
> > we can move it to flaky group.
> > 
> > Penghui
> > 
> > On Wed, Mar 9, 2022 at 4:27 AM Michael Marshall <mmarsh...@apache.org>
> > wrote:
> > 
> >> This is a great discussion, thanks for starting it.
> >> 
> >> I support turning off retries for new tests, as Lari suggests.
> >> However, I will point out that this change is not the same as the
> >> initially proposed change to increase the number of times a new test
> >> is run (point (1) in the original email for this thread). I am not
> >> sure if we need to run new tests extra times to test for flakiness,
> >> but I definitely agree that we should not default to retrying newly
> >> added tests.
> >> 
> >> Alternatively, would it be possible to only retry tests that are
> >> annotated as "flaky"? This would exclude all new tests, by default,
> >> and it'd make it clearer which tests need to be fixed.
> >> 
> >>> I see. Now we retry all tests by default. What's the reason?
> >> 
> >> Test retries are enabled precisely because some tests are flaky.
> >> 
> >> Thanks,
> >> Michael
> >> 
> >> 
> >> On Tue, Mar 8, 2022 at 7:57 AM Baozi <wudixiaolong...@icloud.com.invalid>
> >> wrote:
> >>> 
> >>> Good job!
> >>> 
> >>> I see. Now we retry all tests by default. What's the reason?
> >>> 
> >>> The process I understand is:
> >>> We don't need to retry the unit test. Once the test fails, let the job
> >> fail.
> >>> The difference is that new or modified tests need to be tested many
> >> times. As long as there is one failure, the job will fail.
> >>> 
> >>>> 2022年3月8日 21:0956,Lari Hotari <lhot...@apache.org> 写道:
> >>>> 
> >>>> Here's a PR to disable tests retries for new and modified tests:
> >>>> https://github.com/apache/pulsar/pull/14607
> >>>> 
> >>>> Please review and provide feedback.
> >>>> 
> >>>> -Lari
> >>>> 
> >>>> On 2022/03/08 11:24:13 Lari Hotari wrote:
> >>>>> I think this is a good idea. In CI, the tests will be retried multiple
> >>>>> times at multiple levels: at the test class / test method level and
> >> at the
> >>>>> build level.
> >>>>> 
> >>>>> I'd like to get rid of test retries completely, since that is the
> >> reason
> >>>>> why the flakiness problem gets worse over time. This was also covered
> >> in a
> >>>>> proposal that was made about a year ago:
> >>>>> 
> >> https://docs.google.com/document/d/10lmn4pW1IsT_8D1ZE0vMjASX0HhjdGdjB794iyScwns/edit
> >>>>> .
> >>>>> "since the current build tolerates flaky tests, the current test retry
> >>>>> solution leads to more flaky tests being added to the code base over
> >> time."
> >>>>> 
> >>>>> It would be possible to make our CI behave in a way that test method
> >> level
> >>>>> retries are disabled for test classes that are modified or added in a
> >> PR.
> >>>>> 
> >>>>> In GitHub Actions workflows we already have a solution for detecting
> >> the
> >>>>> changed files.
> >>>>> 
> >>>>>     - name: Detect changed files
> >>>>>       id:   changes
> >>>>>       uses: apache/pulsar-test-infra/paths-filter@master
> >>>>>       with:
> >>>>>         filters: .github/changes-filter.yaml
> >>>>> 
> >>>>> It would be possible to make the build work in a way that it passes
> >> the
> >>>>> list of changed tests in a system property or environment variable.
> >> This
> >>>>> could be passed to  org.apache.pulsar.tests.RetryAnalyzer which could
> >> be
> >>>>> modified to disable retries for the tests listed in the parameter
> >> that was
> >>>>> passed via a system property or environment value.
> >>>>> 
> >>>>> Besides this, it would be necessary to disable the build level test
> >> retry
> >>>>> loop (the build/retry.sh solution) when any test in modified in the
> >> module.
> >>>>> It would be possible to check the failures and skip retrying if one
> >> of the
> >>>>> failures is one of the changed tests.
> >>>>> 
> >>>>> -Lari
> >>>>> 
> >>>>> 
> >>>>> On Mon, Mar 7, 2022 at 5:12 AM 包子 <wudixiaolong...@icloud.com
> >> .invalid>
> >>>>> wrote:
> >>>>> 
> >>>>>> Hi, I want to discuss how to improve the stability of unit testing.I
> >> found
> >>>>>> that most flaky unit tests can be reproduc locally and only need to
> >> be
> >>>>>> executed a few more times.
> >>>>>> 
> >>>>>> Can we add the following mandatory constraints to ensure the
> >> stability of
> >>>>>> unit testing of code?
> >>>>>> 1. If new / modified unit tests are included in PR, they need to be
> >> run
> >>>>>> continuously on CI for more than n times.
> >>>>>> 2. We can write scripts and parse git change records to find new or
> >>>>>> modified unit tests.
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>> 
> >> 
> 
> 

Reply via email to