migration-tests: Add test case for responsive CPU throttle

Peter Xu Tue, 10 Sep 2024 14:23:52 -0700

On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> Peter Xu <pet...@redhat.com> writes:
> 
> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> wrote:
> >> >
> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> > the dirty sync count may not always increase because this is
> >> > an optimization that might not happen in any situation.
> >> >
> >> > This test case just making sure it doesn't interfere with any
> >> > current functionality.
> >> >
> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com>
> >> 
> >> tests/qtest/migration-test already runs 75 different
> >> subtests, takes up a massive chunk of our "make check"
> >> time, and is very commonly a "times out" test on some
> >> of our CI jobs. It runs on five different guest CPU
> >> architectures, each one of which takes between 2 and
> >> 5 minutes to complete the full migration-test.
> >> 
> >> Do we really need to make it even bigger?
> >
> > I'll try to find some time in the next few weeks looking into this to see
> > whether we can further shrink migration test times after previous attemps
> > from Dan.  At least a low hanging fruit is we should indeed put some more
> > tests into g_test_slow(), and this new test could also be a candidate (then
> > we can run "-m slow" for migration PRs only).
> 
> I think we could (using -m slow or any other method) separate tests
> that are generic enough that every CI run should benefit from them
> vs. tests that are only useful once someone starts touching migration
> code. I'd say very few in the former category and most of them in the
> latter.
> 
> For an idea of where migration bugs lie, I took a look at what was
> fixed since 2022:
> 
> # bugs | device/subsystem/arch
> ----------------------------------
>     54 | migration
>     10 | vfio
>      6 | ppc
>      3 | virtio-gpu
>      2 | pcie_sriov, tpm_emulator,
>           vdpa, virtio-rng-pci
>      1 | arm, block, gpio, lasi,
>           pci, s390, scsi-disk,
>           virtio-mem, TCG


Just curious; how did you collect these?

> 
> From these, ignoring the migration bugs, the migration-tests cover some
> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> once it is, then virtio-gpu would be covered and we could investigate
> adding some of the others.
> 
> For actual migration code issues:
> 
> # bugs | (sub)subsystem | kind
> ----------------------------------------------
>     13 | multifd        | correctness/races
>      8 | ram            | correctness
>      8 | rdma:          | general programming

8 rdma bugs??? ouch..

>      7 | qmp            | new api bugs
>      5 | postcopy       | races
>      4 | file:          | leaks
>      3 | return path    | races
>      3 | fd_cleanup     | races
>      2 | savevm, aio/coroutines
>      1 | xbzrle, colo, dirtyrate, exec:,
>           windows, iochannel, qemufile,
>           arch (ppc64le)
> 
> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> 
> My suggestion is we run per arch:
> 
> "/precopy/tcp/plain"
> "/precopy/tcp/tls/psk/match",
> "/postcopy/plain"
> "/postcopy/preempt/plain"
> "/postcopy/preempt/recovery/plain"
> "/multifd/tcp/plain/cancel"
> "/multifd/tcp/uri/plain/none"

Don't you want to still keep a few multifd / file tests?

IIUC some file ops can still be relevant to archs.  Multifd still has one
bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
race condition when migration finishes, and the issue could be memory
ordering relevant, but maybe not.

> 
> and x86 gets extra:
> 
> "/precopy/unix/suspend/live"
> "/precopy/unix/suspend/notlive"
> "/dirty_ring"

dirty ring will be disabled anyway when !x86, so probably not a major
concern.

> 
> (the other dirty_* tests are too slow)

These are the 10 slowest tests when I run locally:

/x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
/x86_64/migration/postcopy/recovery/plain 2.43
/x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
/x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
/x86_64/migration/postcopy/tls/psk 2.91
/x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
/x86_64/migration/postcopy/preempt/tls/psk 3.30
/x86_64/migration/postcopy/recovery/tls/psk 3.81
/x86_64/migration/vcpu_dirty_limit 13.29
/x86_64/migration/precopy/unix/xbzrle 27.55

Are you aware of people using xbzrle at all?

> 
> All the rest go behind a knob that people touching migration code will
> enable.
> 
> wdyt?

Agree with the general idea, but I worry above exact list can be too small.

IMHO we can definitely, at least, move the last two into slow list
(vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..

> 
> 1- allows adding devices to QEMU cmdline for migration-test
> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de
> 

-- 
Peter Xu

Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle

Reply via email to