On Mon, 16 Jan 2017 at 10:38:42 +0200, Lars Wirzenius wrote: > A failing test means there's a bug. It might be in the test itself, or > in the code being tested. It might be a bug in the test environment.
Nobody is disputing this, but we have bug severities for a reason: not every bug is release-critical. If we gated packages on "has no known bugs" we'd never release anything. > Personally, I'd really rather have unreliable tests fixed. Of course, but it isn't always feasible to drop everything and fix an unreliable test, or the bug that the test illustrates - the cause of an intermittent bug is often hard to determine. Until that can happen, I'd rather have the test sometimes or always fail, ideally reported as XFAIL or TODO or something (distinguishing it from "significant" failures), so I can use the information that it produces. For example, several of the ostree tests intermittently failed for a long time, which turned out to be (we think) a libsoup thread-safety bug. If I had disabled those tests on ci.debian.net altogether, then I wouldn't have been able to tell upstream "those tests have stopped failing since fixing libsoup, so that fix is probably all we need". > Apart from social exclusion, unreliable tests waste a lot of time, > effort, and mental energy. Yes, and in an ideal world they wouldn't exist. This world is demonstrably not ideal, and the code we release is not perfect (if it was, we wouldn't need tests). Would you prefer it if packages whose tests are not fully reliable just stopped running them altogether, or even deleted them? I would very much prefer that we run tests, even the imperfect ones, because CPU time is cheap and more information is better than less information. I've opened: autopkgtest: define Restrictions for tests that aren't suitable for gating CI http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=851556 and sent a patch to: autopkgtest: should be possible to ignore test restrictions by request https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850494 in the hope that we can use those as a way to mark certain tests as "failure is non-critical". S