Peter Xu <pet...@redhat.com> writes: > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> >> Peter Xu <pet...@redhat.com> writes: >> >> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> >> >> >> wrote: >> >> >> > >> >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> >> > the dirty sync count may not always increase because this is >> >> >> > an optimization that might not happen in any situation. >> >> >> > >> >> >> > This test case just making sure it doesn't interfere with any >> >> >> > current functionality. >> >> >> > >> >> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> >> >> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> >> subtests, takes up a massive chunk of our "make check" >> >> >> time, and is very commonly a "times out" test on some >> >> >> of our CI jobs. It runs on five different guest CPU >> >> >> architectures, each one of which takes between 2 and >> >> >> 5 minutes to complete the full migration-test. >> >> >> >> >> >> Do we really need to make it even bigger? >> >> > >> >> > I'll try to find some time in the next few weeks looking into this to >> >> > see >> >> > whether we can further shrink migration test times after previous >> >> > attemps >> >> > from Dan. At least a low hanging fruit is we should indeed put some >> >> > more >> >> > tests into g_test_slow(), and this new test could also be a candidate >> >> > (then >> >> > we can run "-m slow" for migration PRs only). >> >> >> >> I think we could (using -m slow or any other method) separate tests >> >> that are generic enough that every CI run should benefit from them >> >> vs. tests that are only useful once someone starts touching migration >> >> code. I'd say very few in the former category and most of them in the >> >> latter. >> >> >> >> For an idea of where migration bugs lie, I took a look at what was >> >> fixed since 2022: >> >> >> >> # bugs | device/subsystem/arch >> >> ---------------------------------- >> >> 54 | migration >> >> 10 | vfio >> >> 6 | ppc >> >> 3 | virtio-gpu >> >> 2 | pcie_sriov, tpm_emulator, >> >> vdpa, virtio-rng-pci >> >> 1 | arm, block, gpio, lasi, >> >> pci, s390, scsi-disk, >> >> virtio-mem, TCG >> > >> > Just curious; how did you collect these? >> >> git log --since=2022 and then squinted at it. I wrote a warning to take >> this with a grain of salt, but it missed the final version. >> >> > >> >> >> >> From these, ignoring the migration bugs, the migration-tests cover some >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but >> >> once it is, then virtio-gpu would be covered and we could investigate >> >> adding some of the others. >> >> >> >> For actual migration code issues: >> >> >> >> # bugs | (sub)subsystem | kind >> >> ---------------------------------------------- >> >> 13 | multifd | correctness/races >> >> 8 | ram | correctness >> >> 8 | rdma: | general programming >> > >> > 8 rdma bugs??? ouch.. >> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed >> integer comparisons and bugs in error handling. I don't even want to >> look too much at it. >> >> ...hopefully this release we'll manage to resolve that situation. >> >> > >> >> 7 | qmp | new api bugs >> >> 5 | postcopy | races >> >> 4 | file: | leaks >> >> 3 | return path | races >> >> 3 | fd_cleanup | races >> >> 2 | savevm, aio/coroutines >> >> 1 | xbzrle, colo, dirtyrate, exec:, >> >> windows, iochannel, qemufile, >> >> arch (ppc64le) >> >> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> >> >> My suggestion is we run per arch: >> >> >> >> "/precopy/tcp/plain" >> >> "/precopy/tcp/tls/psk/match", >> >> "/postcopy/plain" >> >> "/postcopy/preempt/plain" >> >> "/postcopy/preempt/recovery/plain" >> >> "/multifd/tcp/plain/cancel" >> >> "/multifd/tcp/uri/plain/none" >> > >> > Don't you want to still keep a few multifd / file tests? >> >> Not really, but I won't object if you want to add some more in there. To >> be honest, I want to get out of people's way as much as I can because >> having to revisit this every couple of months is stressful to me. > > I hope migration tests are not too obstructive yet so far :) - this > discussion is still within a context where we might add one more slow test > in CI, and we probably simply should make it a -m slow test. > >> >> My rationale for those is: >> >> "/precopy/tcp/plain": >> Smoke test, the most common migration > > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. > > And note that even if unix shares the socket iochannel with tcp, it may not > work the same. For example, if you remember I mentioned I was looking at > threadify the dest qemu receive side, i have a draft there but currently it > has a bug to hang a unix migration, not tcp..
Ok, let's add a unix one, no problem. > > So IMHO it's not easy to justify which we can simply drop, if we still want > to provide some good gating in CI. It's not exactly about dropping, it's about which ones need to be run at *every* invocation of make check (x5 because of the multiple archs) and at every git branch push in CI (unless -o ci.skip). For gating we don't need all the tests. Many of them are testing the same core code with just a few differences at the margins. > And I won't be 100% surprised if some > other non-migration PR (e.g. io/) changed some slight bit that unix is > broken and tcp keeps working, for example, then we loose some CI benefits. IMO, these non-migration PRs are exactly what we have to worry about. Because migration PRs would run with -m slow and we'd catch the issue there. > >> >> "/precopy/tcp/tls/psk/match": >> Something might change in the distro regarding tls. Such as: >> https://gitlab.com/qemu-project/qemu/-/issues/1937 >> >> "/postcopy/plain": >> Smoke test for postcopy >> >> "/postcopy/preempt/plain": >> Just to exercise the preempt thread >> >> "/postcopy/preempt/recovery/plain": >> Recovery has had some issues with races in the past >> >> "/multifd/tcp/plain/cancel": >> The MVP of catching concurrency issues >> >> "/multifd/tcp/uri/plain/none": >> Smoke test for multifd >> >> All in all, these will test basic funcionality and run very often. The >> more tests we add to this set, the less return we get in relation to the >> time they take. > > This is true. We can try to discuss more on which is more important; I > still think something might be good to be added on top of above. Sure, just add what you think we need. > > There's also the other way - at some point, I still want to improve > migration-test run speed, and please have a look if you like too at some > point: so there's still chance (average is ~2sec as of now), IMHO, we don't > lose anything in CI but runs simply faster. My only idea, but it requires a bit of work, is to do unit testing on the interfaces. Anything before migration_fd_connect(). Then we could stop doing a full migration for those. > >> >> > >> > IIUC some file ops can still be relevant to archs. Multifd still has one >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's a >> > race condition when migration finishes, and the issue could be memory >> > ordering relevant, but maybe not. >> >> I'm not aware of anything. I believe the last arm64 bug we had was the >> threadinfo stuff[1]. If you remember what it's about, let me know. >> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads"). > > https://issues.redhat.com/browse/RHEL-45628 Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't think other tests will. Also, we don't have tests for zerocopy AFAIK. > > On RH side Bandan is looking at it, but he's during a long holidays > recently. Good luck to him. > >> >> > >> >> >> >> and x86 gets extra: >> >> >> >> "/precopy/unix/suspend/live" >> >> "/precopy/unix/suspend/notlive" >> >> "/dirty_ring" >> > >> > dirty ring will be disabled anyway when !x86, so probably not a major >> > concern. >> > >> >> >> >> (the other dirty_* tests are too slow) >> > >> > These are the 10 slowest tests when I run locally: >> > >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 >> > /x86_64/migration/postcopy/recovery/plain 2.43 >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 >> > /x86_64/migration/postcopy/tls/psk 2.91 >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30 >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81 >> > /x86_64/migration/vcpu_dirty_limit 13.29 >> > /x86_64/migration/precopy/unix/xbzrle 27.55 >> > >> > Are you aware of people using xbzrle at all? >> >> Nope. >> >> > >> >> >> >> All the rest go behind a knob that people touching migration code will >> >> enable. >> >> >> >> wdyt? >> > >> > Agree with the general idea, but I worry above exact list can be too small. >> >> We won't stop running the full set of tests. We can run them in our >> forks' CI as much as we want. There are no cases of people reporting a >> migration bug because their 'make check' caught something that ours >> didn't. > > IIUC it's hard to say - when the test is in CI maintainers can catch them > already before sending a formal PR. > > If the test is run by default in make check, a developer can trigger a > migration issue (with his/her patch applied) then one can notice it > introduced a bug, fix it, then post the patches. We won't know whether > that happened. Good point. > > So one thing we can do (if you think worthwhile to do it now) is we shrink > the default test case a lot as you proposed, then we wait and see what > breaks, and then we gradually add tests back when it can be used to find > breakages. But that'll also take time if it really can find such tests, > because then we'll need to debug them one by one (instead of developer / > maintainer looking into them with their expertise knowledge..). I'm not > sure whether it's worthwhile to do it now, but I don't feel strongly if we > can still have a reliable set of default test cases. We first need a way to enable them for the migration CI. Do we have a variable that CI understands that can be used to enable slow tests? > >> >> Besides, the main strength of CI is to catch bugs when someone makes a >> code change. If people touch migration code, then we'll run it in our CI >> anyway. If they touch device code and that device is migrated by default >> then any one of the simple tests will catch the issue when it runs via >> the migration-compat job. If the device is not enabled by default, then >> no tests will catch it. >> >> The worst case scenario is they touch some code completely unrelated and >> their 'make check' or CI run breaks because of some race in the >> migration code. That's not what CI is for, that's just an annoyance for >> everyone. I'd rather it breaks in our hands and then we'll go fix it. >> >> > >> > IMHO we can definitely, at least, move the last two into slow list >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run.. >> >> Agreed. I'll send a patch once I get out from under downstream stuff. >> >> > >> >> >> >> 1- allows adding devices to QEMU cmdline for migration-test >> >> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de >> >> >>