Hello.

We have a failing CI for quite a long time (maybe few months). And out of
that there are always new flaky tests failing our PRs time to time. We did
fix some of them but I do not think it is possible to avoid them at all.

I propose a simple approach/convention to quickly recover CI from such
appearing intermittent problems. When we detect a flaky test let's do two
simple PRs:

1. [Disable Flaky Test] PR will skip problem tests and flag them to be
easily found in the image.

testWithIntermitentIssue

    self flag: #flakyTest

    self skip.

    "the rest of test code"

This PR is supposed to be quickly merged to avoid failures in new PRs

2. [Enable Flaky Test] PR will enable tests back.
It will record the issue and track the current "flaky state".

For example I created two PRs for Zinc tests:
- [Disable Flaky Tests] Disable two Zinc flaky tests
<https://github.com/pharo-project/pharo/pull/6497>
- [Enable Flaky Tests] Enable two Zinc flaky tests
<https://github.com/pharo-project/pharo/pull/6498>
The enable PR here will be always red until we integrate a fix.

Of course some issues are trivial to fix like increasing the allowed time
for the test. And we should just push the fix. But when it is not clear
what to do it is better to remove the case from the overall CI and localize
the issue in concrete PR. So expert devs could look at the problem without
interrupting the contribution of other people.

I think it's a very easy approach to follow by anyone. And it can be even
automated.
That's my idea.
Best regards,
Denis

Reply via email to