Peter Xu <pet...@redhat.com> writes: > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> wrote: >> >> > >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> > the dirty sync count may not always increase because this is >> >> > an optimization that might not happen in any situation. >> >> > >> >> > This test case just making sure it doesn't interfere with any >> >> > current functionality. >> >> > >> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> subtests, takes up a massive chunk of our "make check" >> >> time, and is very commonly a "times out" test on some >> >> of our CI jobs. It runs on five different guest CPU >> >> architectures, each one of which takes between 2 and >> >> 5 minutes to complete the full migration-test. >> >> >> >> Do we really need to make it even bigger? >> > >> > I'll try to find some time in the next few weeks looking into this to see >> > whether we can further shrink migration test times after previous attemps >> > from Dan. At least a low hanging fruit is we should indeed put some more >> > tests into g_test_slow(), and this new test could also be a candidate (then >> > we can run "-m slow" for migration PRs only). >> >> I think we could (using -m slow or any other method) separate tests >> that are generic enough that every CI run should benefit from them >> vs. tests that are only useful once someone starts touching migration >> code. I'd say very few in the former category and most of them in the >> latter. >> >> For an idea of where migration bugs lie, I took a look at what was >> fixed since 2022: >> >> # bugs | device/subsystem/arch >> ---------------------------------- >> 54 | migration >> 10 | vfio >> 6 | ppc >> 3 | virtio-gpu >> 2 | pcie_sriov, tpm_emulator, >> vdpa, virtio-rng-pci >> 1 | arm, block, gpio, lasi, >> pci, s390, scsi-disk, >> virtio-mem, TCG > > Just curious; how did you collect these?
git log --since=2022 and then squinted at it. I wrote a warning to take this with a grain of salt, but it missed the final version. > >> >> From these, ignoring the migration bugs, the migration-tests cover some >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but >> once it is, then virtio-gpu would be covered and we could investigate >> adding some of the others. >> >> For actual migration code issues: >> >> # bugs | (sub)subsystem | kind >> ---------------------------------------------- >> 13 | multifd | correctness/races >> 8 | ram | correctness >> 8 | rdma: | general programming > > 8 rdma bugs??? ouch.. Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed integer comparisons and bugs in error handling. I don't even want to look too much at it. ...hopefully this release we'll manage to resolve that situation. > >> 7 | qmp | new api bugs >> 5 | postcopy | races >> 4 | file: | leaks >> 3 | return path | races >> 3 | fd_cleanup | races >> 2 | savevm, aio/coroutines >> 1 | xbzrle, colo, dirtyrate, exec:, >> windows, iochannel, qemufile, >> arch (ppc64le) >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> My suggestion is we run per arch: >> >> "/precopy/tcp/plain" >> "/precopy/tcp/tls/psk/match", >> "/postcopy/plain" >> "/postcopy/preempt/plain" >> "/postcopy/preempt/recovery/plain" >> "/multifd/tcp/plain/cancel" >> "/multifd/tcp/uri/plain/none" > > Don't you want to still keep a few multifd / file tests? Not really, but I won't object if you want to add some more in there. To be honest, I want to get out of people's way as much as I can because having to revisit this every couple of months is stressful to me. My rationale for those is: "/precopy/tcp/plain": Smoke test, the most common migration "/precopy/tcp/tls/psk/match": Something might change in the distro regarding tls. Such as: https://gitlab.com/qemu-project/qemu/-/issues/1937 "/postcopy/plain": Smoke test for postcopy "/postcopy/preempt/plain": Just to exercise the preempt thread "/postcopy/preempt/recovery/plain": Recovery has had some issues with races in the past "/multifd/tcp/plain/cancel": The MVP of catching concurrency issues "/multifd/tcp/uri/plain/none": Smoke test for multifd All in all, these will test basic funcionality and run very often. The more tests we add to this set, the less return we get in relation to the time they take. > > IIUC some file ops can still be relevant to archs. Multifd still has one > bug that can only reproduce on arm64.. but not x86_64. I remember it's a > race condition when migration finishes, and the issue could be memory > ordering relevant, but maybe not. I'm not aware of anything. I believe the last arm64 bug we had was the threadinfo stuff[1]. If you remember what it's about, let me know. 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). > >> >> and x86 gets extra: >> >> "/precopy/unix/suspend/live" >> "/precopy/unix/suspend/notlive" >> "/dirty_ring" > > dirty ring will be disabled anyway when !x86, so probably not a major > concern. > >> >> (the other dirty_* tests are too slow) > > These are the 10 slowest tests when I run locally: > > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 > /x86_64/migration/postcopy/recovery/plain 2.43 > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 > /x86_64/migration/postcopy/tls/psk 2.91 > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 > /x86_64/migration/postcopy/preempt/tls/psk 3.30 > /x86_64/migration/postcopy/recovery/tls/psk 3.81 > /x86_64/migration/vcpu_dirty_limit 13.29 > /x86_64/migration/precopy/unix/xbzrle 27.55 > > Are you aware of people using xbzrle at all? Nope. > >> >> All the rest go behind a knob that people touching migration code will >> enable. >> >> wdyt? > > Agree with the general idea, but I worry above exact list can be too small. We won't stop running the full set of tests. We can run them in our forks' CI as much as we want. There are no cases of people reporting a migration bug because their 'make check' caught something that ours didn't. Besides, the main strength of CI is to catch bugs when someone makes a code change. If people touch migration code, then we'll run it in our CI anyway. If they touch device code and that device is migrated by default then any one of the simple tests will catch the issue when it runs via the migration-compat job. If the device is not enabled by default, then no tests will catch it. The worst case scenario is they touch some code completely unrelated and their 'make check' or CI run breaks because of some race in the migration code. That's not what CI is for, that's just an annoyance for everyone. I'd rather it breaks in our hands and then we'll go fix it. > > IMHO we can definitely, at least, move the last two into slow list > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. Agreed. I'll send a patch once I get out from under downstream stuff. > >> >> 1- allows adding devices to QEMU cmdline for migration-test >> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de >>