On Thu, 12 Sept 2024 at 16:09, Peter Xu <pet...@redhat.com> wrote:
>
> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote:
> > > I don't think we're discussing total CI time at this point, so the math
> > > doesn't really add up. We're not looking into making the CI finish
> > > faster. We're looking into making migration-test finish faster. That
> > > would reduce timeouts in CI, speed-up make check and reduce the chance
> > > of random race conditions* affecting other people/staging runs.
> >
> > Right. The reason migration-test appears on my radar is because
> > it is very frequently the thing that shows up as "this sometimes
> > just fails or just times out and if you hit retry it goes away
> > again". That might not be migration-test's fault specifically,
> > because those retries tend to be certain CI configs (s390,
> > the i686-tci one), and I have some theories about what might be
> > causing it (e.g. build system runs 4 migration-tests in parallel,
> > which means 8 QEMU processes which is too many for the number
> > of host CPUs). But right now I look at CI job failures and my reaction
> > is "oh, it's the migration-test failing yet again" :-(
> >
> > For some examples from this week:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>
> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> parallel.  It indeed sounds like no good way to finally solve.. I don't
> also see how speeding up / reducing tests in migration test would help, as
> that's (from some degree..) is the same as tuning the timeout value bigger.
> When the tests are less it'll fit into 480s window, but maybe it's too
> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> can timeout again when on a busy host with less capability of concurrency.

For the TIMEOUT part on cross-i686-tci I plan to try this patch:
https://patchew.org/QEMU/20240912151003.2045031-1-peter.mayd...@linaro.org/
which makes 'make check' single-threaded; that will help to see
if the parallelism is a problem. (If it is then we might want
to do a more generalised approach rather than just for that one
CI job.)

> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> here before the log expires:

> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> different host / arch being tested.  What is more weird is the two failures
> are different, the 2nd failure throw out a TLS error even though the test
> doesn't yet have tls involved.
>
> Fabiano, is this the issue you're looking at?
>
> Peter, do you think it'll be helpful if we temporarily mark this test as
> "slow" too so it's not run in CI (so we still run it ourselves when prepare
> migration PR, with the hope that it can reproduce)?

If you think that specific test is flaky then I think that's
probably a good idea. As usual with this kind of thing,
probably best to have a comment next to the test noting
why and with a URL to a gitlab issue for it, so we don't
forget why we disabled the test.

-- PMM

Reply via email to