Peter Xu <pet...@redhat.com> writes: > On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote: >> >> Peter Xu <pet...@redhat.com> writes: >> >> >> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote: >> >> >> Peter Xu <pet...@redhat.com> writes: >> >> >> >> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote: >> >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> >> >> >> >> wrote: >> >> >> >> > >> >> >> >> > Despite the fact that the responsive CPU throttle is enabled, >> >> >> >> > the dirty sync count may not always increase because this is >> >> >> >> > an optimization that might not happen in any situation. >> >> >> >> > >> >> >> >> > This test case just making sure it doesn't interfere with any >> >> >> >> > current functionality. >> >> >> >> > >> >> >> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> >> >> >> >> >> >> >> >> tests/qtest/migration-test already runs 75 different >> >> >> >> subtests, takes up a massive chunk of our "make check" >> >> >> >> time, and is very commonly a "times out" test on some >> >> >> >> of our CI jobs. It runs on five different guest CPU >> >> >> >> architectures, each one of which takes between 2 and >> >> >> >> 5 minutes to complete the full migration-test. >> >> >> >> >> >> >> >> Do we really need to make it even bigger? >> >> >> > >> >> >> > I'll try to find some time in the next few weeks looking into this >> >> >> > to see >> >> >> > whether we can further shrink migration test times after previous >> >> >> > attemps >> >> >> > from Dan. At least a low hanging fruit is we should indeed put some >> >> >> > more >> >> >> > tests into g_test_slow(), and this new test could also be a >> >> >> > candidate (then >> >> >> > we can run "-m slow" for migration PRs only). >> >> >> >> >> >> I think we could (using -m slow or any other method) separate tests >> >> >> that are generic enough that every CI run should benefit from them >> >> >> vs. tests that are only useful once someone starts touching migration >> >> >> code. I'd say very few in the former category and most of them in the >> >> >> latter. >> >> >> >> >> >> For an idea of where migration bugs lie, I took a look at what was >> >> >> fixed since 2022: >> >> >> >> >> >> # bugs | device/subsystem/arch >> >> >> ---------------------------------- >> >> >> 54 | migration >> >> >> 10 | vfio >> >> >> 6 | ppc >> >> >> 3 | virtio-gpu >> >> >> 2 | pcie_sriov, tpm_emulator, >> >> >> vdpa, virtio-rng-pci >> >> >> 1 | arm, block, gpio, lasi, >> >> >> pci, s390, scsi-disk, >> >> >> virtio-mem, TCG >> >> > >> >> > Just curious; how did you collect these? >> >> >> >> git log --since=2022 and then squinted at it. I wrote a warning to take >> >> this with a grain of salt, but it missed the final version. >> >> >> >> > >> >> >> >> >> >> From these, ignoring the migration bugs, the migration-tests cover some >> >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, >> >> >> but >> >> >> once it is, then virtio-gpu would be covered and we could investigate >> >> >> adding some of the others. >> >> >> >> >> >> For actual migration code issues: >> >> >> >> >> >> # bugs | (sub)subsystem | kind >> >> >> ---------------------------------------------- >> >> >> 13 | multifd | correctness/races >> >> >> 8 | ram | correctness >> >> >> 8 | rdma: | general programming >> >> > >> >> > 8 rdma bugs??? ouch.. >> >> >> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed >> >> integer comparisons and bugs in error handling. I don't even want to >> >> look too much at it. >> >> >> >> ...hopefully this release we'll manage to resolve that situation. >> >> >> >> > >> >> >> 7 | qmp | new api bugs >> >> >> 5 | postcopy | races >> >> >> 4 | file: | leaks >> >> >> 3 | return path | races >> >> >> 3 | fd_cleanup | races >> >> >> 2 | savevm, aio/coroutines >> >> >> 1 | xbzrle, colo, dirtyrate, exec:, >> >> >> windows, iochannel, qemufile, >> >> >> arch (ppc64le) >> >> >> >> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy, >> >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle. >> >> >> >> >> >> My suggestion is we run per arch: >> >> >> >> >> >> "/precopy/tcp/plain" >> >> >> "/precopy/tcp/tls/psk/match", >> >> >> "/postcopy/plain" >> >> >> "/postcopy/preempt/plain" >> >> >> "/postcopy/preempt/recovery/plain" >> >> >> "/multifd/tcp/plain/cancel" >> >> >> "/multifd/tcp/uri/plain/none" >> >> > >> >> > Don't you want to still keep a few multifd / file tests? >> >> >> >> Not really, but I won't object if you want to add some more in there. To >> >> be honest, I want to get out of people's way as much as I can because >> >> having to revisit this every couple of months is stressful to me. >> > >> > I hope migration tests are not too obstructive yet so far :) - this >> > discussion is still within a context where we might add one more slow test >> > in CI, and we probably simply should make it a -m slow test. >> > >> >> >> >> My rationale for those is: >> >> >> >> "/precopy/tcp/plain": >> >> Smoke test, the most common migration >> > >> > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt. >> > >> > And note that even if unix shares the socket iochannel with tcp, it may not >> > work the same. For example, if you remember I mentioned I was looking at >> > threadify the dest qemu receive side, i have a draft there but currently it >> > has a bug to hang a unix migration, not tcp.. >> >> Ok, let's add a unix one, no problem. >> >> > >> > So IMHO it's not easy to justify which we can simply drop, if we still want >> > to provide some good gating in CI. >> >> It's not exactly about dropping, it's about which ones need to be run at >> *every* invocation of make check (x5 because of the multiple archs) and > > Yeah, indeed we could consider reducing the number of runs, maybe. However > I do remember we used to have migration-test bugs only reproduced with some > specific distro, well.. > > Now I'm looking at the pipelines.. > > Why 5x, btw?
5x because migration-test supports 5 architectures (i386, x86_64, ppc64, aarch64, s390x), so if your build includes all of those, make check will run migration-test five times. > I saw alphine, centos, debian, fedora, opensuse, ubuntu, > cfi-x86_64, then compat-x86_64. So that's 8x? They're for the same arch > (amd64) so far. I agree we should fine tune this in CI, but this is a different discussion. We can't currently skip migration-test without skipping make check-qtest. I proposed a while ago to have a separate make check-migration, but people didn't like that, maybe we should revisit the idea. > > Maybe we can skip some indeed. > >> at every git branch push in CI (unless -o ci.skip). For gating we don't >> need all the tests. Many of them are testing the same core code with >> just a few differences at the margins. >> >> > And I won't be 100% surprised if some >> > other non-migration PR (e.g. io/) changed some slight bit that unix is >> > broken and tcp keeps working, for example, then we loose some CI benefits. >> >> IMO, these non-migration PRs are exactly what we have to worry >> about. Because migration PRs would run with -m slow and we'd catch the >> issue there. >> >> > >> >> >> >> "/precopy/tcp/tls/psk/match": >> >> Something might change in the distro regarding tls. Such as: >> >> https://gitlab.com/qemu-project/qemu/-/issues/1937 >> >> >> >> "/postcopy/plain": >> >> Smoke test for postcopy >> >> >> >> "/postcopy/preempt/plain": >> >> Just to exercise the preempt thread >> >> >> >> "/postcopy/preempt/recovery/plain": >> >> Recovery has had some issues with races in the past >> >> >> >> "/multifd/tcp/plain/cancel": >> >> The MVP of catching concurrency issues >> >> >> >> "/multifd/tcp/uri/plain/none": >> >> Smoke test for multifd >> >> >> >> All in all, these will test basic funcionality and run very often. The >> >> more tests we add to this set, the less return we get in relation to the >> >> time they take. >> > >> > This is true. We can try to discuss more on which is more important; I >> > still think something might be good to be added on top of above. >> >> Sure, just add what you think we need. > > Let's leave the detailed discussion on what to choose until the patch > phase. IIUC this can be the last we do. ok > > We've just queued your other patch to "slow down" the two time consuming > tests, we save 40s for each CI kick (total ~90s now locally, so that's 30% > cut already all acrosss!), not so bad as a start. > > Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's > 3 system runs, it'll be another 62% cut, a total of: > > 1.0 - 9/13 * 3/8 = ~75% > > So if we keep removing 5 system running it we can reduce it to only a > quarter time comparing to before for each CI run. > > Would that be good enough already for us to live for quite a few more > releases? They're so far all low hanging fruits. Reducing the runs in CI is not a low hanging fruit. We'd need to first split migration-test from check-qtest somehow. But that would mean not running migration-test at all during make check, which is worse than just reducing migration-test while keeping it in make check. > > If we want to move further (and you think we should then start with > reducing test case, rather than profiling the test cases), then feel free > to go ahead and send a patch when you're ready. But maybe you don't want > to bother after you notice that we can already shrink 75% of runtime even > without dropping anything. Your call! I don't think we're discussing total CI time at this point, so the math doesn't really add up. We're not looking into making the CI finish faster. We're looking into making migration-test finish faster. That would reduce timeouts in CI, speed-up make check and reduce the chance of random race conditions* affecting other people/staging runs. *- like the one I'm debugging at this very moment in multifd+ppc, are you aware of that btw? > >> >> > >> > There's also the other way - at some point, I still want to improve >> > migration-test run speed, and please have a look if you like too at some >> > point: so there's still chance (average is ~2sec as of now), IMHO, we don't >> > lose anything in CI but runs simply faster. >> >> My only idea, but it requires a bit of work, is to do unit testing on >> the interfaces. Anything before migration_fd_connect(). Then we could >> stop doing a full migration for those. > > What I'm thinking is some further profiling, like what Dan used to do. I > feel like we still have chance, considering what we run as guest image is > simply a loop.. so basically zero boot time logically. I started looking again today into code coverage tools to see if we can trim the tests that are redundant. But no promises, I already have a queue of stuff that's pending. > >> >> > >> >> >> >> > >> >> > IIUC some file ops can still be relevant to archs. Multifd still has >> >> > one >> >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's >> >> > a >> >> > race condition when migration finishes, and the issue could be memory >> >> > ordering relevant, but maybe not. >> >> >> >> I'm not aware of anything. I believe the last arm64 bug we had was the >> >> threadinfo stuff[1]. If you remember what it's about, let me know. >> >> >> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to >> >> migration_threads"). >> > >> > https://issues.redhat.com/browse/RHEL-45628 >> >> Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't >> think other tests will. Also, we don't have tests for zerocopy AFAIK. > > That'll be challenging as it requires locked_vm limit. > >> >> > >> > On RH side Bandan is looking at it, but he's during a long holidays >> > recently. >> >> Good luck to him. >> >> > >> >> >> >> > >> >> >> >> >> >> and x86 gets extra: >> >> >> >> >> >> "/precopy/unix/suspend/live" >> >> >> "/precopy/unix/suspend/notlive" >> >> >> "/dirty_ring" >> >> > >> >> > dirty ring will be disabled anyway when !x86, so probably not a major >> >> > concern. >> >> > >> >> >> >> >> >> (the other dirty_* tests are too slow) >> >> > >> >> > These are the 10 slowest tests when I run locally: >> >> > >> >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41 >> >> > /x86_64/migration/postcopy/recovery/plain 2.43 >> >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66 >> >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86 >> >> > /x86_64/migration/postcopy/tls/psk 2.91 >> >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08 >> >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30 >> >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81 >> >> > /x86_64/migration/vcpu_dirty_limit 13.29 >> >> > /x86_64/migration/precopy/unix/xbzrle 27.55 >> >> > >> >> > Are you aware of people using xbzrle at all? >> >> >> >> Nope. >> >> >> >> > >> >> >> >> >> >> All the rest go behind a knob that people touching migration code will >> >> >> enable. >> >> >> >> >> >> wdyt? >> >> > >> >> > Agree with the general idea, but I worry above exact list can be too >> >> > small. >> >> >> >> We won't stop running the full set of tests. We can run them in our >> >> forks' CI as much as we want. There are no cases of people reporting a >> >> migration bug because their 'make check' caught something that ours >> >> didn't. >> > >> > IIUC it's hard to say - when the test is in CI maintainers can catch them >> > already before sending a formal PR. >> > >> > If the test is run by default in make check, a developer can trigger a >> > migration issue (with his/her patch applied) then one can notice it >> > introduced a bug, fix it, then post the patches. We won't know whether >> > that happened. >> >> Good point. >> >> > >> > So one thing we can do (if you think worthwhile to do it now) is we shrink >> > the default test case a lot as you proposed, then we wait and see what >> > breaks, and then we gradually add tests back when it can be used to find >> > breakages. But that'll also take time if it really can find such tests, >> > because then we'll need to debug them one by one (instead of developer / >> > maintainer looking into them with their expertise knowledge..). I'm not >> > sure whether it's worthwhile to do it now, but I don't feel strongly if we >> > can still have a reliable set of default test cases. >> >> We first need a way to enable them for the migration CI. Do we have a >> variable that CI understands that can be used to enable slow tests? > > I'm not aware of. Yes sounds like we should have one if it doesn't exist. > >> >> > >> >> >> >> Besides, the main strength of CI is to catch bugs when someone makes a >> >> code change. If people touch migration code, then we'll run it in our CI >> >> anyway. If they touch device code and that device is migrated by default >> >> then any one of the simple tests will catch the issue when it runs via >> >> the migration-compat job. If the device is not enabled by default, then >> >> no tests will catch it. >> >> >> >> The worst case scenario is they touch some code completely unrelated and >> >> their 'make check' or CI run breaks because of some race in the >> >> migration code. That's not what CI is for, that's just an annoyance for >> >> everyone. I'd rather it breaks in our hands and then we'll go fix it. >> >> >> >> > >> >> > IMHO we can definitely, at least, move the last two into slow list >> >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each >> >> > run.. >> >> >> >> Agreed. I'll send a patch once I get out from under downstream stuff. >> >> >> >> > >> >> >> >> >> >> 1- allows adding devices to QEMU cmdline for migration-test >> >> >> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de >> >> >> >> >> >>