migration-tests: Add test case for responsive CPU throttle

Peter Xu Wed, 11 Sep 2024 09:00:05 -0700

On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
> Peter Xu <pet...@redhat.com> writes:
> 
> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <pet...@redhat.com> writes:
> >> 
> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> wrote:
> >> >> >
> >> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> >> > the dirty sync count may not always increase because this is
> >> >> > an optimization that might not happen in any situation.
> >> >> >
> >> >> > This test case just making sure it doesn't interfere with any
> >> >> > current functionality.
> >> >> >
> >> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com>
> >> >> 
> >> >> tests/qtest/migration-test already runs 75 different
> >> >> subtests, takes up a massive chunk of our "make check"
> >> >> time, and is very commonly a "times out" test on some
> >> >> of our CI jobs. It runs on five different guest CPU
> >> >> architectures, each one of which takes between 2 and
> >> >> 5 minutes to complete the full migration-test.
> >> >> 
> >> >> Do we really need to make it even bigger?
> >> >
> >> > I'll try to find some time in the next few weeks looking into this to see
> >> > whether we can further shrink migration test times after previous attemps
> >> > from Dan.  At least a low hanging fruit is we should indeed put some more
> >> > tests into g_test_slow(), and this new test could also be a candidate 
> >> > (then
> >> > we can run "-m slow" for migration PRs only).
> >> 
> >> I think we could (using -m slow or any other method) separate tests
> >> that are generic enough that every CI run should benefit from them
> >> vs. tests that are only useful once someone starts touching migration
> >> code. I'd say very few in the former category and most of them in the
> >> latter.
> >> 
> >> For an idea of where migration bugs lie, I took a look at what was
> >> fixed since 2022:
> >> 
> >> # bugs | device/subsystem/arch
> >> ----------------------------------
> >>     54 | migration
> >>     10 | vfio
> >>      6 | ppc
> >>      3 | virtio-gpu
> >>      2 | pcie_sriov, tpm_emulator,
> >>           vdpa, virtio-rng-pci
> >>      1 | arm, block, gpio, lasi,
> >>           pci, s390, scsi-disk,
> >>           virtio-mem, TCG
> >
> > Just curious; how did you collect these?
> 
> git log --since=2022 and then squinted at it. I wrote a warning to take
> this with a grain of salt, but it missed the final version.
> 
> >
> >> 
> >> From these, ignoring the migration bugs, the migration-tests cover some
> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> >> once it is, then virtio-gpu would be covered and we could investigate
> >> adding some of the others.
> >> 
> >> For actual migration code issues:
> >> 
> >> # bugs | (sub)subsystem | kind
> >> ----------------------------------------------
> >>     13 | multifd        | correctness/races
> >>      8 | ram            | correctness
> >>      8 | rdma:          | general programming
> >
> > 8 rdma bugs??? ouch..
> 
> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
> integer comparisons and bugs in error handling. I don't even want to
> look too much at it.
> 
> ...hopefully this release we'll manage to resolve that situation.
> 
> >
> >>      7 | qmp            | new api bugs
> >>      5 | postcopy       | races
> >>      4 | file:          | leaks
> >>      3 | return path    | races
> >>      3 | fd_cleanup     | races
> >>      2 | savevm, aio/coroutines
> >>      1 | xbzrle, colo, dirtyrate, exec:,
> >>           windows, iochannel, qemufile,
> >>           arch (ppc64le)
> >> 
> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> >> 
> >> My suggestion is we run per arch:
> >> 
> >> "/precopy/tcp/plain"
> >> "/precopy/tcp/tls/psk/match",
> >> "/postcopy/plain"
> >> "/postcopy/preempt/plain"
> >> "/postcopy/preempt/recovery/plain"
> >> "/multifd/tcp/plain/cancel"
> >> "/multifd/tcp/uri/plain/none"
> >
> > Don't you want to still keep a few multifd / file tests?
> 
> Not really, but I won't object if you want to add some more in there. To
> be honest, I want to get out of people's way as much as I can because
> having to revisit this every couple of months is stressful to me.


I hope migration tests are not too obstructive yet so far :) - this
discussion is still within a context where we might add one more slow test
in CI, and we probably simply should make it a -m slow test.

> 
> My rationale for those is:
> 
> "/precopy/tcp/plain":
>  Smoke test, the most common migration

E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.

And note that even if unix shares the socket iochannel with tcp, it may not
work the same.  For example, if you remember I mentioned I was looking at
threadify the dest qemu receive side, i have a draft there but currently it
has a bug to hang a unix migration, not tcp..

So IMHO it's not easy to justify which we can simply drop, if we still want
to provide some good gating in CI.  And I won't be 100% surprised if some
other non-migration PR (e.g. io/) changed some slight bit that unix is
broken and tcp keeps working, for example, then we loose some CI benefits.

> 
> "/precopy/tcp/tls/psk/match":
>  Something might change in the distro regarding tls. Such as:
>  https://gitlab.com/qemu-project/qemu/-/issues/1937
> 
> "/postcopy/plain":
>  Smoke test for postcopy
> 
> "/postcopy/preempt/plain":
>  Just to exercise the preempt thread
> 
> "/postcopy/preempt/recovery/plain":
>  Recovery has had some issues with races in the past
> 
> "/multifd/tcp/plain/cancel":
>  The MVP of catching concurrency issues
> 
> "/multifd/tcp/uri/plain/none":
>  Smoke test for multifd
> 
> All in all, these will test basic funcionality and run very often. The
> more tests we add to this set, the less return we get in relation to the
> time they take.

This is true.  We can try to discuss more on which is more important; I
still think something might be good to be added on top of above.

There's also the other way - at some point, I still want to improve
migration-test run speed, and please have a look if you like too at some
point: so there's still chance (average is ~2sec as of now), IMHO, we don't
lose anything in CI but runs simply faster.

> 
> >
> > IIUC some file ops can still be relevant to archs.  Multifd still has one
> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
> > race condition when migration finishes, and the issue could be memory
> > ordering relevant, but maybe not.
> 
> I'm not aware of anything. I believe the last arm64 bug we had was the
> threadinfo stuff[1]. If you remember what it's about, let me know.
> 
> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").

https://issues.redhat.com/browse/RHEL-45628

On RH side Bandan is looking at it, but he's during a long holidays
recently.

> 
> >
> >> 
> >> and x86 gets extra:
> >> 
> >> "/precopy/unix/suspend/live"
> >> "/precopy/unix/suspend/notlive"
> >> "/dirty_ring"
> >
> > dirty ring will be disabled anyway when !x86, so probably not a major
> > concern.
> >
> >> 
> >> (the other dirty_* tests are too slow)
> >
> > These are the 10 slowest tests when I run locally:
> >
> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
> > /x86_64/migration/postcopy/recovery/plain 2.43
> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
> > /x86_64/migration/postcopy/tls/psk 2.91
> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
> > /x86_64/migration/vcpu_dirty_limit 13.29
> > /x86_64/migration/precopy/unix/xbzrle 27.55
> >
> > Are you aware of people using xbzrle at all?
> 
> Nope.
> 
> >
> >> 
> >> All the rest go behind a knob that people touching migration code will
> >> enable.
> >> 
> >> wdyt?
> >
> > Agree with the general idea, but I worry above exact list can be too small.
> 
> We won't stop running the full set of tests. We can run them in our
> forks' CI as much as we want. There are no cases of people reporting a
> migration bug because their 'make check' caught something that ours
> didn't.

IIUC it's hard to say - when the test is in CI maintainers can catch them
already before sending a formal PR.

If the test is run by default in make check, a developer can trigger a
migration issue (with his/her patch applied) then one can notice it
introduced a bug, fix it, then post the patches.  We won't know whether
that happened.

So one thing we can do (if you think worthwhile to do it now) is we shrink
the default test case a lot as you proposed, then we wait and see what
breaks, and then we gradually add tests back when it can be used to find
breakages.  But that'll also take time if it really can find such tests,
because then we'll need to debug them one by one (instead of developer /
maintainer looking into them with their expertise knowledge..).  I'm not
sure whether it's worthwhile to do it now, but I don't feel strongly if we
can still have a reliable set of default test cases.

> 
> Besides, the main strength of CI is to catch bugs when someone makes a
> code change. If people touch migration code, then we'll run it in our CI
> anyway. If they touch device code and that device is migrated by default
> then any one of the simple tests will catch the issue when it runs via
> the migration-compat job. If the device is not enabled by default, then
> no tests will catch it.
> 
> The worst case scenario is they touch some code completely unrelated and
> their 'make check' or CI run breaks because of some race in the
> migration code. That's not what CI is for, that's just an annoyance for
> everyone. I'd rather it breaks in our hands and then we'll go fix it.
> 
> >
> > IMHO we can definitely, at least, move the last two into slow list
> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
> 
> Agreed. I'll send a patch once I get out from under downstream stuff.
> 
> >
> >> 
> >> 1- allows adding devices to QEMU cmdline for migration-test
> >> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de
> >> 
> 

-- 
Peter Xu

Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle

Reply via email to