Peter Xu <pet...@redhat.com> writes:

> On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
>> Peter Xu <pet...@redhat.com> writes:
>> 
>> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <pet...@redhat.com> writes:
>> >> 
>> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
>> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.hu...@smartx.com> 
>> >> >> wrote:
>> >> >> >
>> >> >> > Despite the fact that the responsive CPU throttle is enabled,
>> >> >> > the dirty sync count may not always increase because this is
>> >> >> > an optimization that might not happen in any situation.
>> >> >> >
>> >> >> > This test case just making sure it doesn't interfere with any
>> >> >> > current functionality.
>> >> >> >
>> >> >> > Signed-off-by: Hyman Huang <yong.hu...@smartx.com>
>> >> >> 
>> >> >> tests/qtest/migration-test already runs 75 different
>> >> >> subtests, takes up a massive chunk of our "make check"
>> >> >> time, and is very commonly a "times out" test on some
>> >> >> of our CI jobs. It runs on five different guest CPU
>> >> >> architectures, each one of which takes between 2 and
>> >> >> 5 minutes to complete the full migration-test.
>> >> >> 
>> >> >> Do we really need to make it even bigger?
>> >> >
>> >> > I'll try to find some time in the next few weeks looking into this to 
>> >> > see
>> >> > whether we can further shrink migration test times after previous 
>> >> > attemps
>> >> > from Dan.  At least a low hanging fruit is we should indeed put some 
>> >> > more
>> >> > tests into g_test_slow(), and this new test could also be a candidate 
>> >> > (then
>> >> > we can run "-m slow" for migration PRs only).
>> >> 
>> >> I think we could (using -m slow or any other method) separate tests
>> >> that are generic enough that every CI run should benefit from them
>> >> vs. tests that are only useful once someone starts touching migration
>> >> code. I'd say very few in the former category and most of them in the
>> >> latter.
>> >> 
>> >> For an idea of where migration bugs lie, I took a look at what was
>> >> fixed since 2022:
>> >> 
>> >> # bugs | device/subsystem/arch
>> >> ----------------------------------
>> >>     54 | migration
>> >>     10 | vfio
>> >>      6 | ppc
>> >>      3 | virtio-gpu
>> >>      2 | pcie_sriov, tpm_emulator,
>> >>           vdpa, virtio-rng-pci
>> >>      1 | arm, block, gpio, lasi,
>> >>           pci, s390, scsi-disk,
>> >>           virtio-mem, TCG
>> >
>> > Just curious; how did you collect these?
>> 
>> git log --since=2022 and then squinted at it. I wrote a warning to take
>> this with a grain of salt, but it missed the final version.
>> 
>> >
>> >> 
>> >> From these, ignoring the migration bugs, the migration-tests cover some
>> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
>> >> once it is, then virtio-gpu would be covered and we could investigate
>> >> adding some of the others.
>> >> 
>> >> For actual migration code issues:
>> >> 
>> >> # bugs | (sub)subsystem | kind
>> >> ----------------------------------------------
>> >>     13 | multifd        | correctness/races
>> >>      8 | ram            | correctness
>> >>      8 | rdma:          | general programming
>> >
>> > 8 rdma bugs??? ouch..
>> 
>> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
>> integer comparisons and bugs in error handling. I don't even want to
>> look too much at it.
>> 
>> ...hopefully this release we'll manage to resolve that situation.
>> 
>> >
>> >>      7 | qmp            | new api bugs
>> >>      5 | postcopy       | races
>> >>      4 | file:          | leaks
>> >>      3 | return path    | races
>> >>      3 | fd_cleanup     | races
>> >>      2 | savevm, aio/coroutines
>> >>      1 | xbzrle, colo, dirtyrate, exec:,
>> >>           windows, iochannel, qemufile,
>> >>           arch (ppc64le)
>> >> 
>> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
>> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
>> >> 
>> >> My suggestion is we run per arch:
>> >> 
>> >> "/precopy/tcp/plain"
>> >> "/precopy/tcp/tls/psk/match",
>> >> "/postcopy/plain"
>> >> "/postcopy/preempt/plain"
>> >> "/postcopy/preempt/recovery/plain"
>> >> "/multifd/tcp/plain/cancel"
>> >> "/multifd/tcp/uri/plain/none"
>> >
>> > Don't you want to still keep a few multifd / file tests?
>> 
>> Not really, but I won't object if you want to add some more in there. To
>> be honest, I want to get out of people's way as much as I can because
>> having to revisit this every couple of months is stressful to me.
>
> I hope migration tests are not too obstructive yet so far :) - this
> discussion is still within a context where we might add one more slow test
> in CI, and we probably simply should make it a -m slow test.
>
>> 
>> My rationale for those is:
>> 
>> "/precopy/tcp/plain":
>>  Smoke test, the most common migration
>
> E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.
>
> And note that even if unix shares the socket iochannel with tcp, it may not
> work the same.  For example, if you remember I mentioned I was looking at
> threadify the dest qemu receive side, i have a draft there but currently it
> has a bug to hang a unix migration, not tcp..

Ok, let's add a unix one, no problem.

>
> So IMHO it's not easy to justify which we can simply drop, if we still want
> to provide some good gating in CI.

It's not exactly about dropping, it's about which ones need to be run at
*every* invocation of make check (x5 because of the multiple archs) and
at every git branch push in CI (unless -o ci.skip). For gating we don't
need all the tests. Many of them are testing the same core code with
just a few differences at the margins.

>  And I won't be 100% surprised if some
> other non-migration PR (e.g. io/) changed some slight bit that unix is
> broken and tcp keeps working, for example, then we loose some CI benefits.

IMO, these non-migration PRs are exactly what we have to worry
about. Because migration PRs would run with -m slow and we'd catch the
issue there.

>
>> 
>> "/precopy/tcp/tls/psk/match":
>>  Something might change in the distro regarding tls. Such as:
>>  https://gitlab.com/qemu-project/qemu/-/issues/1937
>> 
>> "/postcopy/plain":
>>  Smoke test for postcopy
>> 
>> "/postcopy/preempt/plain":
>>  Just to exercise the preempt thread
>> 
>> "/postcopy/preempt/recovery/plain":
>>  Recovery has had some issues with races in the past
>> 
>> "/multifd/tcp/plain/cancel":
>>  The MVP of catching concurrency issues
>> 
>> "/multifd/tcp/uri/plain/none":
>>  Smoke test for multifd
>> 
>> All in all, these will test basic funcionality and run very often. The
>> more tests we add to this set, the less return we get in relation to the
>> time they take.
>
> This is true.  We can try to discuss more on which is more important; I
> still think something might be good to be added on top of above.

Sure, just add what you think we need.

>
> There's also the other way - at some point, I still want to improve
> migration-test run speed, and please have a look if you like too at some
> point: so there's still chance (average is ~2sec as of now), IMHO, we don't
> lose anything in CI but runs simply faster.

My only idea, but it requires a bit of work, is to do unit testing on
the interfaces. Anything before migration_fd_connect(). Then we could
stop doing a full migration for those.

>
>> 
>> >
>> > IIUC some file ops can still be relevant to archs.  Multifd still has one
>> > bug that can only reproduce on arm64.. but not x86_64.  I remember it's a
>> > race condition when migration finishes, and the issue could be memory
>> > ordering relevant, but maybe not.
>> 
>> I'm not aware of anything. I believe the last arm64 bug we had was the
>> threadinfo stuff[1]. If you remember what it's about, let me know.
>> 
>> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").
>
> https://issues.redhat.com/browse/RHEL-45628

Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't
think other tests will. Also, we don't have tests for zerocopy AFAIK.

>
> On RH side Bandan is looking at it, but he's during a long holidays
> recently.

Good luck to him.

>
>> 
>> >
>> >> 
>> >> and x86 gets extra:
>> >> 
>> >> "/precopy/unix/suspend/live"
>> >> "/precopy/unix/suspend/notlive"
>> >> "/dirty_ring"
>> >
>> > dirty ring will be disabled anyway when !x86, so probably not a major
>> > concern.
>> >
>> >> 
>> >> (the other dirty_* tests are too slow)
>> >
>> > These are the 10 slowest tests when I run locally:
>> >
>> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
>> > /x86_64/migration/postcopy/recovery/plain 2.43
>> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
>> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
>> > /x86_64/migration/postcopy/tls/psk 2.91
>> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
>> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
>> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
>> > /x86_64/migration/vcpu_dirty_limit 13.29
>> > /x86_64/migration/precopy/unix/xbzrle 27.55
>> >
>> > Are you aware of people using xbzrle at all?
>> 
>> Nope.
>> 
>> >
>> >> 
>> >> All the rest go behind a knob that people touching migration code will
>> >> enable.
>> >> 
>> >> wdyt?
>> >
>> > Agree with the general idea, but I worry above exact list can be too small.
>> 
>> We won't stop running the full set of tests. We can run them in our
>> forks' CI as much as we want. There are no cases of people reporting a
>> migration bug because their 'make check' caught something that ours
>> didn't.
>
> IIUC it's hard to say - when the test is in CI maintainers can catch them
> already before sending a formal PR.
>
> If the test is run by default in make check, a developer can trigger a
> migration issue (with his/her patch applied) then one can notice it
> introduced a bug, fix it, then post the patches.  We won't know whether
> that happened.

Good point.

>
> So one thing we can do (if you think worthwhile to do it now) is we shrink
> the default test case a lot as you proposed, then we wait and see what
> breaks, and then we gradually add tests back when it can be used to find
> breakages.  But that'll also take time if it really can find such tests,
> because then we'll need to debug them one by one (instead of developer /
> maintainer looking into them with their expertise knowledge..).  I'm not
> sure whether it's worthwhile to do it now, but I don't feel strongly if we
> can still have a reliable set of default test cases.

We first need a way to enable them for the migration CI. Do we have a
variable that CI understands that can be used to enable slow tests?

>
>> 
>> Besides, the main strength of CI is to catch bugs when someone makes a
>> code change. If people touch migration code, then we'll run it in our CI
>> anyway. If they touch device code and that device is migrated by default
>> then any one of the simple tests will catch the issue when it runs via
>> the migration-compat job. If the device is not enabled by default, then
>> no tests will catch it.
>> 
>> The worst case scenario is they touch some code completely unrelated and
>> their 'make check' or CI run breaks because of some race in the
>> migration code. That's not what CI is for, that's just an annoyance for
>> everyone. I'd rather it breaks in our hands and then we'll go fix it.
>> 
>> >
>> > IMHO we can definitely, at least, move the last two into slow list
>> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each run..
>> 
>> Agreed. I'll send a patch once I get out from under downstream stuff.
>> 
>> >
>> >> 
>> >> 1- allows adding devices to QEMU cmdline for migration-test
>> >> https://lore.kernel.org/r/20240523201922.28007-4-faro...@suse.de
>> >> 
>> 

Reply via email to