Peter Maydell <peter.mayd...@linaro.org> writes:

> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote:
>> I don't think we're discussing total CI time at this point, so the math
>> doesn't really add up. We're not looking into making the CI finish
>> faster. We're looking into making migration-test finish faster. That
>> would reduce timeouts in CI, speed-up make check and reduce the chance
>> of random race conditions* affecting other people/staging runs.
>
> Right. The reason migration-test appears on my radar is because
> it is very frequently the thing that shows up as "this sometimes
> just fails or just times out and if you hit retry it goes away
> again". That might not be migration-test's fault specifically,
> because those retries tend to be certain CI configs (s390,
> the i686-tci one), and I have some theories about what might be
> causing it (e.g. build system runs 4 migration-tests in parallel,
> which means 8 QEMU processes which is too many for the number
> of host CPUs). But right now I look at CI job failures and my reaction
> is "oh, it's the migration-test failing yet again" :-(

And then I go: "oh, people complaining about migration-test again, I
thought we had fixed all the issues this time". It's frustrating for
everyone, as I said previously.

>
> For some examples from this week:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155

About these:

There are 2 instances of plain-old-SIGSEGV here. Both happen in
non-x86_64 runs and on the /multifd/tcp/plain/cancel test, which means
they're either races or memory ordering issues. Having i386 crashing
points to the former. So having the CI loaded and causing timeouts is
probably what exposed the issue.

The thread is mig/dst/recv_7 and grepping the objdump output shows:
<set_bit_atomic> 55 48 89 e5 48 89 7d e8 48 89 75 e0 48 8b 45 e8 83 e0
3f ba 01 00 00 00 89 c1 48 d3 e2 48 89 d0 48 89 45 f0 48 8b 45 e8 48 c1
e8 06 48 8d 14 c5 00 00 00 00 48 8b 45 e0 48 01 d0 48 89 45 f8 48 8b 45
f8 48 8b 55 f0 <f0> 48 09 10 90 5d c3

I tried a bisect overnight, but it seems the issue has been there since
before 9.0. I'll try to repro with gdb attached or get a core dump.

Reply via email to