On Thu, Sep 12, 2024 at 04:14:20PM +0100, Peter Maydell wrote: > On Thu, 12 Sept 2024 at 16:09, Peter Xu <pet...@redhat.com> wrote: > > > > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote: > > > > I don't think we're discussing total CI time at this point, so the math > > > > doesn't really add up. We're not looking into making the CI finish > > > > faster. We're looking into making migration-test finish faster. That > > > > would reduce timeouts in CI, speed-up make check and reduce the chance > > > > of random race conditions* affecting other people/staging runs. > > > > > > Right. The reason migration-test appears on my radar is because > > > it is very frequently the thing that shows up as "this sometimes > > > just fails or just times out and if you hit retry it goes away > > > again". That might not be migration-test's fault specifically, > > > because those retries tend to be certain CI configs (s390, > > > the i686-tci one), and I have some theories about what might be > > > causing it (e.g. build system runs 4 migration-tests in parallel, > > > which means 8 QEMU processes which is too many for the number > > > of host CPUs). But right now I look at CI job failures and my reaction > > > is "oh, it's the migration-test failing yet again" :-( > > > > > > For some examples from this week: > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > > > > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > > parallel. It indeed sounds like no good way to finally solve.. I don't > > also see how speeding up / reducing tests in migration test would help, as > > that's (from some degree..) is the same as tuning the timeout value bigger. > > When the tests are less it'll fit into 480s window, but maybe it's too > > quick now we wonder whether we should shrink it to e.g. 90s, but then it > > can timeout again when on a busy host with less capability of concurrency. > > For the TIMEOUT part on cross-i686-tci I plan to try this patch: > https://patchew.org/QEMU/20240912151003.2045031-1-peter.mayd...@linaro.org/ > which makes 'make check' single-threaded; that will help to see > if the parallelism is a problem. (If it is then we might want > to do a more generalised approach rather than just for that one > CI job.)
Sounds good. > > > But indeed there're two ERRORs ([1,2] above).. I collected some more info > > here before the log expires: > > > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > > different host / arch being tested. What is more weird is the two failures > > are different, the 2nd failure throw out a TLS error even though the test > > doesn't yet have tls involved. > > > > Fabiano, is this the issue you're looking at? > > > > Peter, do you think it'll be helpful if we temporarily mark this test as > > "slow" too so it's not run in CI (so we still run it ourselves when prepare > > migration PR, with the hope that it can reproduce)? > > If you think that specific test is flaky then I think that's > probably a good idea. As usual with this kind of thing, > probably best to have a comment next to the test noting > why and with a URL to a gitlab issue for it, so we don't > forget why we disabled the test. Looks like Fabiano root-caused the issue. We'll see how that goes, or we can prepare a patch to make it optional with the comments in place. Thanks, -- Peter Xu