The biggest issue with the reliability of the CI is a deep design decision in the way the tests are setup. Many doctests have an inherent random element, and this is mostly on purpose to increase the surface of code paths that are tested and thereby discover new bugs. The disadvantage is that unfortunately some test runs will produce failures that are not connected to the changes of the PR. I don't see really anything that can be done on the level of the CI infrastructure to improve the situation, but would be happy to get new ideas.
What would help is to a) open a new issue whenever you see an unrelated test failure (so that we can keep track on when/how it happens) and b) work on such issues (searching for 'random' or 'flaky' or 'CI' in the github issues should bring up most of them, eg https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22). There were some recent pushes to resolve some of those random failures, notably by user202729. I also have a half-working notebook that extracts the failing tests from the CI runs at https://github.com/sagemath/sage/pull/39100, which would help with statistics and point a) above. > Is the number of CI minutes we use a month a problem for us? No not really. I don't quite remember what plan the Sagemath org is on, but it's not limited on how many minutes per month we can use but instead we have a certain quota of 'runners' that can work in parallel. And we do hit this limit sometimes, especially after a new release when certain longer-taking runs are triggered and a lot of people update their branches. Then it takes a bit longer until the CI results for a PR roll in. We had way more serious issues in this regard, but by now it should work relatively smoothly. There are two other sources of 'systematic' failures: - Sometimes PRs introduce reproducible build errors on a small subset of systems. This then leads to failures of the CI runs that check those systems after a new release. Matthias used to invest a lot of time and energy into fixing those; I don't have the time to do this but will open an issue if I see such a failure and then after some time disable the failing system (recent example: https://github.com/sagemath/sage/pull/40675). - The buildbots tested by Volker on a new release differ in many aspects from the github CI runs. But Volker only looks at the buildbots (to my knowledge) when deciding if a PR is okay to be merged. In particular, almost all recent failures of the linter workflow are a result of this discrepancy. My goal and hope is that we can retire the buildbots sooner than later. On Tuesday, August 26, 2025 at 8:56:43 AM UTC+8 Kwankyu Lee wrote: 1. Aa far as I know, Matthias (currently off duty) did the most work in setting up the original CI infrastructure. This is based on traditional tools: make and docker. Small clarification: Matthias introduced the "portability" workflows that check sage-the-distro on various systems and are run after a new release. All the remaining workflows (essentially everything that runs now for PRs) were initially contributed by me 4 or 5 years ago (with the idea to fully migrate to github at some point). -- You received this message because you are subscribed to the Google Groups "sage-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/sage-devel/c60393c2-598a-4d4f-ac61-22201781c874n%40googlegroups.com.