To slightly expand on the above: We do have tests with fixed (build.yml) 
and with variable random seed (ci-meson.yml). But even with a fixed seed 
you do get sometimes, often non-reproducible, failures. Some of these 
issues are known for a long time (eg 
https://github.com/sagemath/sage/issues/29528 is open for 5 years now).
A related question would be: shall we temporarily disable tests that are 
known to randomly fail?
Advantage: less noise due to random failures
Disadvantage: less coverage

>  I wonder if the stranger/unreproducible failures might be caused by some 
faulty caching on the CI server, but I don't know enough about how the CI 
server is configured and what is cached between builds to say if that might 
be the case.

>From my experience, these issues are almost never specific to CI (i.e. the 
same error could be reproduced in principle by running the same commands 
locally on a developer's machine). The only exceptions are issues related 
to "docker pull/push" that you sometimes see. Those come from the design 
decision to run the CI in a new docker container. Fixing those issues by 
redesiging the corresponding workflows would be desirable (see below). 

>  I think we should not drop support of a system (platform) failing 
because of bugs introduced by PRs.

In theory, I agree. In practice, however, not a single of those portability 
issues that I've opened were fixed. And there is no point in burning CPU 
cycles if we already know the build will fail on that system - it's also 
confusing to people looking at the CI results since it's not clear that 
this is a known issue.

>  It would be nice to have a GitHub label for these kinds of issues so 
they can be found more easily. I'm not sure who has permissions to add new 
labels.

Good idea! Needs to be done by one of the github org admins.

I will see if I find some time to document the CI infrastructure a bit. In 
my opinion, it's design is pretty stable by now and I at least don't have 
any major plans for further restructuring in the near future. A few items I 
would like to work on:
- Migrate the "long" tests to meson 
(https://github.com/sagemath/sage/pull/40158)
- Redesign the ci-distro workflow to work directly on the system's 
container and not going through docker + tox (similar to how ci.yml works) 
For macos this was done in https://github.com/sagemath/sage/pull/40516.
- Rework `dist.yml` to be based on meson, build wheels for sagelib and a 
few general stability improvements (eg use gh cli/action to create the 
release)

On Wednesday, August 27, 2025 at 11:28:15 AM UTC+8 Vincent Macri wrote:

>
>    - My understanding is that all CI runs now use the same random seed
>    
> We do use random seeds, and we should.
>
> https://github.com/sagemath/sage/issues/40632 was found because we have a 
> test that generates a random input to a function and 0 is not valid input 
> for that function. For some seeds, like the one used by the CI for one run 
> which found this bug, the test generated 0 as the input value. This is a 
> bug and so I think this demonstrates that we should use random seeds. It 
> was also unrelated to the PR. Perhaps one could run tests for both a fixed 
> and random seed to avoid unrelated failures while still testing random 
> inputs, but doubling the amount of CI tests we run seems somewhat wasteful.
>
>
>    1. Sometimes PRs introduce reproducible *build errors* on a small 
>    subset of systems ... then after some time disable the failing system
>    2. I think we should not drop support of a system (platform) failing 
>    because of *bugs* introduced by PRs.
>    
>
> I think you two are saying two different things (I added emphasis for the 
> difference) and I generally agree with both.
>
> If Sage fails to build on a system, especially an old system that has 
> reached end of life (which I think is the case for what Tobias was talking 
> about) I don't think we need to spend time trying to support it. For 
> example, we regularly drop support for old Python versions, but there are 
> surely some systems where newer Python versions aren't available. I think 
> that's fine. Ideally, we would have removed the unsupported systems from CI 
> when we dropped support for the old Python. Better yet would be if the CI 
> knew which versions of those distros we test are supported without us 
> having to update the CI configuration whenever there is a new or EOL 
> Ubuntu/Fedora/whatever.
>
> As for bugs, if the system builds Sage and then a test fails after it 
> successfully builds, that's a problem. If Sage builds but has failing tests 
> on say, the most recent Fedora that's something we obviously should not 
> ignore. If it builds and tests fail on an older system, maybe some Ubuntu 
> LTS that's past EOL, I think it's still worth investigating why the failure 
> occurred. Maybe it's relevant to Sage and tells us something, maybe it 
> doesn't. Depending on what test failed and why we evaluate if it's a 
> regression or just a matter of trying to run Sage on a system that is past 
> EOL where a recent enough version of some compiler/library is unavailable. 
> In a perfect world we would track down what specific version of what 
> library is causing the bug and update the build dependencies to not allow 
> that version. In practice I think that might be an unrealistic amount of 
> work for us that provides little practical benefit.
>

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/sage-devel/ea91dec1-278a-4400-8c7f-c5ae85e59d9cn%40googlegroups.com.

Reply via email to