On Thu, Sep 12, 2024 at 04:14:20PM +0100, Peter Maydell wrote:
> On Thu, 12 Sept 2024 at 16:09, Peter Xu <pet...@redhat.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> > > On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote:
> > > > I don't think we're discussing total CI time at this point, so the math
> > > > doesn't really add up. We're not looking into making the CI finish
> > > > faster. We're looking into making migration-test finish faster. That
> > > > would reduce timeouts in CI, speed-up make check and reduce the chance
> > > > of random race conditions* affecting other people/staging runs.
> > >
> > > Right. The reason migration-test appears on my radar is because
> > > it is very frequently the thing that shows up as "this sometimes
> > > just fails or just times out and if you hit retry it goes away
> > > again". That might not be migration-test's fault specifically,
> > > because those retries tend to be certain CI configs (s390,
> > > the i686-tci one), and I have some theories about what might be
> > > causing it (e.g. build system runs 4 migration-tests in parallel,
> > > which means 8 QEMU processes which is too many for the number
> > > of host CPUs). But right now I look at CI job failures and my reaction
> > > is "oh, it's the migration-test failing yet again" :-(
> > >
> > > For some examples from this week:
> > >
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
> >
> > Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> > parallel.  It indeed sounds like no good way to finally solve.. I don't
> > also see how speeding up / reducing tests in migration test would help, as
> > that's (from some degree..) is the same as tuning the timeout value bigger.
> > When the tests are less it'll fit into 480s window, but maybe it's too
> > quick now we wonder whether we should shrink it to e.g. 90s, but then it
> > can timeout again when on a busy host with less capability of concurrency.
> 
> For the TIMEOUT part on cross-i686-tci I plan to try this patch:
> https://patchew.org/QEMU/20240912151003.2045031-1-peter.mayd...@linaro.org/
> which makes 'make check' single-threaded; that will help to see
> if the parallelism is a problem. (If it is then we might want
> to do a more generalised approach rather than just for that one
> CI job.)

Sounds good.

> 
> > But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> > here before the log expires:
> 
> > So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> > different host / arch being tested.  What is more weird is the two failures
> > are different, the 2nd failure throw out a TLS error even though the test
> > doesn't yet have tls involved.
> >
> > Fabiano, is this the issue you're looking at?
> >
> > Peter, do you think it'll be helpful if we temporarily mark this test as
> > "slow" too so it's not run in CI (so we still run it ourselves when prepare
> > migration PR, with the hope that it can reproduce)?
> 
> If you think that specific test is flaky then I think that's
> probably a good idea. As usual with this kind of thing,
> probably best to have a comment next to the test noting
> why and with a URL to a gitlab issue for it, so we don't
> forget why we disabled the test.

Looks like Fabiano root-caused the issue.  We'll see how that goes, or we
can prepare a patch to make it optional with the comments in place.

Thanks,

-- 
Peter Xu


Reply via email to