Hi all,

I'd like to start a discussion about the stability of Pulsar CI.

It is common that some tests suite in our CI times out. This is because
when a test fails the entire suite is retried from the beginning (max 3
times). (example:
https://github.com/apache/pulsar/runs/7281063499?check_suite_focus=true)

The command-line retries may sound helpful in making the CI green for a
given pull but they actually hide test failures (that may be flakies or
real issues!!).

Another issue is that you can't easily see the failed test and most of the
time the quickest solution is just to blindly restart the failed jobs. This
is not the correct behaviour and it will make the CI less stable over time.

The plan would be:
- Remove the retries (see https://github.com/apache/pulsar/pull/16524)
- Create issue for flaky tests
- Fix them / move to quarantine

WDYT?

Thanks,
Nicolò Boschi

Reply via email to