On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: > >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> wrote: > >> > > >> > Despite the fact that the responsive CPU throttle is enabled, > >> > the dirty sync count may not always increase because this is > >> > an optimization that might not happen in any situation. > >> > > >> > This test case just making sure it doesn't interfere with any > >> > current functionality. > >> > > >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> > >> > >> tests/qtest/migration-test already runs 75 different > >> subtests, takes up a massive chunk of our "make check" > >> time, and is very commonly a "times out" test on some > >> of our CI jobs. It runs on five different guest CPU > >> architectures, each one of which takes between 2 and > >> 5 minutes to complete the full migration-test. > >> > >> Do we really need to make it even bigger? > > > > I'll try to find some time in the next few weeks looking into this to see > > whether we can further shrink migration test times after previous attemps > > from Dan. At least a low hanging fruit is we should indeed put some more > > tests into g_test_slow(), and this new test could also be a candidate (then > > we can run "-m slow" for migration PRs only). > > I think we could (using -m slow or any other method) separate tests > that are generic enough that every CI run should benefit from them > vs. tests that are only useful once someone starts touching migration > code. I'd say very few in the former category and most of them in the > latter. > > For an idea of where migration bugs lie, I took a look at what was > fixed since 2022: > > # bugs | device/subsystem/arch > ---------------------------------- > 54 | migration > 10 | vfio > 6 | ppc > 3 | virtio-gpu > 2 | pcie_sriov, tpm_emulator, > vdpa, virtio-rng-pci > 1 | arm, block, gpio, lasi, > pci, s390, scsi-disk, > virtio-mem, TCG
Just curious; how did you collect these? > > From these, ignoring the migration bugs, the migration-tests cover some > of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but > once it is, then virtio-gpu would be covered and we could investigate > adding some of the others. > > For actual migration code issues: > > # bugs | (sub)subsystem | kind > ---------------------------------------------- > 13 | multifd | correctness/races > 8 | ram | correctness > 8 | rdma: | general programming 8 rdma bugs??? ouch.. > 7 | qmp | new api bugs > 5 | postcopy | races > 4 | file: | leaks > 3 | return path | races > 3 | fd_cleanup | races > 2 | savevm, aio/coroutines > 1 | xbzrle, colo, dirtyrate, exec:, > windows, iochannel, qemufile, > arch (ppc64le) > > Here, the migration-tests cover well: multifd, ram, qmp, postcopy, > file, rp, fd_cleanup, iochannel, qemufile, xbzrle. > > My suggestion is we run per arch: > > "/precopy/tcp/plain" > "/precopy/tcp/tls/psk/match", > "/postcopy/plain" > "/postcopy/preempt/plain" > "/postcopy/preempt/recovery/plain" > "/multifd/tcp/plain/cancel" > "/multifd/tcp/uri/plain/none" Don't you want to still keep a few multifd / file tests? IIUC some file ops can still be relevant to archs. Multifd still has one bug that can only reproduce on arm64.. but not x86_64. I remember it's a race condition when migration finishes, and the issue could be memory ordering relevant, but maybe not. > > and x86 gets extra: > > "/precopy/unix/suspend/live" > "/precopy/unix/suspend/notlive" > "/dirty_ring" dirty ring will be disabled anyway when !x86, so probably not a major concern. > > (the other dirty_* tests are too slow) These are the 10 slowest tests when I run locally: /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 /x86_64/migration/postcopy/recovery/plain 2.43 /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 /x86_64/migration/postcopy/tls/psk 2.91 /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 /x86_64/migration/postcopy/preempt/tls/psk 3.30 /x86_64/migration/postcopy/recovery/tls/psk 3.81 /x86_64/migration/vcpu_dirty_limit 13.29 /x86_64/migration/precopy/unix/xbzrle 27.55 Are you aware of people using xbzrle at all? > > All the rest go behind a knob that people touching migration code will > enable. > > wdyt? Agree with the general idea, but I worry above exact list can be too small. IMHO we can definitely, at least, move the last two into slow list (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. > > 1- allows adding devices to QEMU cmdline for migration-test > https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de > -- Peter Xu