On Thu, 12 Sept 2024 at 16:09, Peter Xu <pet...@redhat.com> wrote: > > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote: > > > I don't think we're discussing total CI time at this point, so the math > > > doesn't really add up. We're not looking into making the CI finish > > > faster. We're looking into making migration-test finish faster. That > > > would reduce timeouts in CI, speed-up make check and reduce the chance > > > of random race conditions* affecting other people/staging runs. > > > > Right. The reason migration-test appears on my radar is because > > it is very frequently the thing that shows up as "this sometimes > > just fails or just times out and if you hit retry it goes away > > again". That might not be migration-test's fault specifically, > > because those retries tend to be certain CI configs (s390, > > the i686-tci one), and I have some theories about what might be > > causing it (e.g. build system runs 4 migration-tests in parallel, > > which means 8 QEMU processes which is too many for the number > > of host CPUs). But right now I look at CI job failures and my reaction > > is "oh, it's the migration-test failing yet again" :-( > > > > For some examples from this week: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > parallel. It indeed sounds like no good way to finally solve.. I don't > also see how speeding up / reducing tests in migration test would help, as > that's (from some degree..) is the same as tuning the timeout value bigger. > When the tests are less it'll fit into 480s window, but maybe it's too > quick now we wonder whether we should shrink it to e.g. 90s, but then it > can timeout again when on a busy host with less capability of concurrency.
For the TIMEOUT part on cross-i686-tci I plan to try this patch: https://patchew.org/QEMU/20240912151003.2045031-1-peter.mayd...@linaro.org/ which makes 'make check' single-threaded; that will help to see if the parallelism is a problem. (If it is then we might want to do a more generalised approach rather than just for that one CI job.) > But indeed there're two ERRORs ([1,2] above).. I collected some more info > here before the log expires: > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > different host / arch being tested. What is more weird is the two failures > are different, the 2nd failure throw out a TLS error even though the test > doesn't yet have tls involved. > > Fabiano, is this the issue you're looking at? > > Peter, do you think it'll be helpful if we temporarily mark this test as > "slow" too so it's not run in CI (so we still run it ourselves when prepare > migration PR, with the hope that it can reproduce)? If you think that specific test is flaky then I think that's probably a good idea. As usual with this kind of thing, probably best to have a comment next to the test noting why and with a URL to a gitlab issue for it, so we don't forget why we disabled the test. -- PMM