The biggest issue with the reliability of the CI is a deep design decision 
in the way the tests are setup. Many doctests have an inherent random 
element, and this is mostly on purpose to increase the surface of code 
paths that are tested and thereby discover new bugs. The disadvantage is 
that unfortunately some test runs will produce failures that are not 
connected to the changes of the PR. I don't see really anything that can be 
done on the level of the CI infrastructure to improve the situation, but 
would be happy to get new ideas.

What would help is to a) open a new issue whenever you see an unrelated 
test failure (so that we can keep track on when/how it happens) and b) work 
on such issues (searching for 'random' or 'flaky' or 'CI' in the github 
issues should bring up most of them, 
eg 
https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).
 
There were some recent pushes to resolve some of those random failures, 
notably by user202729.

I also have a half-working notebook that extracts the failing tests from 
the CI runs at https://github.com/sagemath/sage/pull/39100, which would 
help with statistics and point a) above.

>  Is the number of CI minutes we use a month a problem for us?

No not really. I don't quite remember what plan the Sagemath org is on, but 
it's not limited on how many minutes per month we can use but instead we 
have a certain quota of 'runners' that can work in parallel. And we do hit 
this limit sometimes, especially after a new release when certain 
longer-taking runs are triggered and a lot of people update their branches. 
Then it takes a bit longer until the CI results for a PR roll in. We had 
way more serious issues in this regard, but by now it should work 
relatively smoothly.

There are two other sources of 'systematic' failures:
- Sometimes PRs introduce reproducible build errors on a small subset of 
systems. This then leads to failures of the CI runs that check those 
systems after a new release. Matthias used to invest a lot of time and 
energy into fixing those; I don't have the time to do this but will open an 
issue if I see such a failure and then after some time disable the failing 
system (recent example: https://github.com/sagemath/sage/pull/40675). 
- The buildbots tested by Volker on a new release differ in many aspects 
from the github CI runs. But Volker only looks at the buildbots (to my 
knowledge) when deciding if a PR is okay to be merged. In particular, 
almost all recent failures of the linter workflow are a result of this 
discrepancy. My goal and hope is that we can retire the buildbots sooner 
than later. 


On Tuesday, August 26, 2025 at 8:56:43 AM UTC+8 Kwankyu Lee wrote:

1. Aa far as I know, Matthias (currently off duty) did the most work in 
setting up the original CI infrastructure. This is based on traditional 
tools: make and docker.


Small clarification: Matthias introduced the "portability" workflows that 
check sage-the-distro on various systems and are run after a new release. 
All the remaining workflows (essentially everything that runs now for PRs) 
were initially contributed by me 4 or 5 years ago (with the idea to fully 
migrate to github at some point).

  

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/sage-devel/c60393c2-598a-4d4f-ac61-22201781c874n%40googlegroups.com.

Reply via email to