[sage-devel] Re: Documentation and state of Sage CI

'tobia...@gmx.de' via sage-devel Tue, 26 Aug 2025 06:56:04 -0700

The biggest issue with the reliability of the CI is a deep design decision 
in the way the tests are setup. Many doctests have an inherent random 
element, and this is mostly on purpose to increase the surface of code 
paths that are tested and thereby discover new bugs. The disadvantage is 
that unfortunately some test runs will produce failures that are not 
connected to the changes of the PR. I don't see really anything that can be 
done on the level of the CI infrastructure to improve the situation, but 
would be happy to get new ideas.

What would help is to a) open a new issue whenever you see an unrelated
test failure (so that we can keep track on when/how it happens) and b) work
on such issues (searching for 'random' or 'flaky' or 'CI' in the github
issues should bring up most of them,
eg
https://github.com/sagemath/sage/issues?q=is%3Aissue%20state%3Aopen%20%22random%22).

There were some recent pushes to resolve some of those random failures,
notably by user202729.

I also have a half-working notebook that extracts the failing tests from
the CI runs at https://github.com/sagemath/sage/pull/39100, which would
help with statistics and point a) above.

> Is the number of CI minutes we use a month a problem for us?

No not really. I don't quite remember what plan the Sagemath org is on, but
it's not limited on how many minutes per month we can use but instead we
have a certain quota of 'runners' that can work in parallel. And we do hit
this limit sometimes, especially after a new release when certain
longer-taking runs are triggered and a lot of people update their branches.
Then it takes a bit longer until the CI results for a PR roll in. We had
way more serious issues in this regard, but by now it should work
relatively smoothly.

There are two other sources of 'systematic' failures:
- Sometimes PRs introduce reproducible build errors on a small subset of
systems. This then leads to failures of the CI runs that check those
systems after a new release. Matthias used to invest a lot of time and
energy into fixing those; I don't have the time to do this but will open an
issue if I see such a failure and then after some time disable the failing
system (recent example: https://github.com/sagemath/sage/pull/40675).
- The buildbots tested by Volker on a new release differ in many aspects
from the github CI runs. But Volker only looks at the buildbots (to my
knowledge) when deciding if a PR is okay to be merged. In particular,
almost all recent failures of the linter workflow are a result of this
discrepancy. My goal and hope is that we can retire the buildbots sooner
than later.

On Tuesday, August 26, 2025 at 8:56:43 AM UTC+8 Kwankyu Lee wrote:

1. Aa far as I know, Matthias (currently off duty) did the most work in
setting up the original CI infrastructure. This is based on traditional
tools: make and docker.

Small clarification: Matthias introduced the "portability" workflows that
check sage-the-distro on various systems and are run after a new release.
All the remaining workflows (essentially everything that runs now for PRs)
were initially contributed by me 4 or 5 years ago (with the idea to fully
migrate to github at some point).

--
You received this message because you are subscribed to the Google Groups
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to sage-devel+unsubscr...@googlegroups.com.
To view this discussion visit
https://groups.google.com/d/msgid/sage-devel/c60393c2-598a-4d4f-ac61-22201781c874n%40googlegroups.com.

[sage-devel] Re: Documentation and state of Sage CI

Reply via email to