Daniel P. Berrangé <berra...@redhat.com> writes: > Total execution time with "-m slow" and x86_64 QEMU, drops from 3 > minutes 15 seconds, down to 54 seconds. > > Individual tests drop from 17-20 seconds, down to 3-4 seconds.
Nice! A few observations on this test (impatient readers may skip to "Conclusions"): * Your testing rig is a fair bit faster than mine. * Run time is almost completely spent in the /TARGET/device/introspect/concrete/*/* cases. Each such case iterates over all known device types. Each iteration introspects one device. Cases concrete/*/M use machine type M. The test ignores "old" versioned machine types; see qtest_is_old_versioned_machine(). Cases concrete/nodefaults/* use -nodefaults, cases concrete/defaults/* don't. * The number of known device types varies between targets from 33 (tricore) to several hundreds (x86_64+i386: 421, ppc 593, arm 667, aarch64 680, ppc64 689). Median is 215, sum is 7485. Aside: why on earth do we need 334 ppc64 CPUs? * The number of tested machine types varies between targets from two to over 60 (arm 62, aarch64 66). Median is 4, sum is 286. * Without -m slow, only /TARGET/device/introspect/concrete/defaults/none is run. This tests #devices introspections, i.e. between 33 and 689 depending on target, 7485 for all targets together. Run times for master (commit eb2c66b10e) vary for me from well under one second to almost 6 seconds, with x86_64 and i386 the slowest, followed by aarch64, arm, ppc64, pcc. s390x. This is what "make check" runs. Total run time for all 32 targets is ~72s. That's roughly 100 introspections per second. Note I measured manual runs, like so: $ QEMU_AUDIO_DRV=none QTEST_QEMU_BINARY="$TARGET-softmmu/qemu-system-$TARGET" QTEST_QEMU_IMG=qemu-img time tests/qtest/device-introspect-test * With -m slow, we run all cases. This is relatively recent: commit 410573aa2c 'tests/device-introspect: Test with all machines, not only with "none"'. The commit message explains: Certain device introspection crashes used to only happen if you were using a certain machine, e.g. if the machine was using serial_hd() or nd_table[], and a device was trying to use these in its instance_init function, too. Aside: these are things an instance_init() should not do. But catching things that should not be done is one reason for having tests. This one is an awfully expensive catcher, but until someone has better ideas... With -m slow, we test 2 * #machines * #devices introspections, i.e. from 132 (tricore) to over 10k (ppc 13046, ppc64 23426, arm 82708, aarch64 89760). Median is ~1600, sum is ~260k. Except we actually test just 89k now, because the test *fails* for arm and aarch64 after some 500 introspections: introspecting device msf2-soc with machine ast2600-evb makes QEMU terminate unsuccessfully with "Unsupported NIC model: ftgmac100". Cause: m2sxxx_soc_initfn() calls qemu_check_nic_model(). Goes back to commit 05b7374a58 "msf2: Add EMAC block to SmartFusion2 SoC", merged some ten weeks ago. This is exactly the kind of mistake the test is designed to catch. How come it wasn't? Possibly due to unlucky combination with the slowdown discussed in the next item (but that was less than four weeks ago). I'll post a patch to fix the bug. Run time increases even more than the number of introspections. Total run time for all targets is around one hour, which is less than 25 introspections per second. ppc64 is the worst by far: 26 *minutes* for 23k tests, less than 15 introspections per second. sparc and rx achieve a paltry 5 introspections per second. * Recent commit e8c9e65816 'qom: Make "info qom-tree" show children sorted' made things worse. Reverting it cuts run time without -m slow from 72s to 43s (100 and 170 introspections per second, respectively), and with -m slow from one hour to 6 minutes (24 and 250 introspections per second, respectively). The speedup with -m slow varies much between targets: from less than 2 (cris, moxie, nios2, or1k, s390x, tricore, unicore32) to more than 20 (ppc64, rx, sparc). The most sluggish introspections profit the most. Paolo pointed out that the sorting is n^2. True. I did not expect n to be large enough for that to matter. What's being sorted is the children of each node in the QOM composition tree. The number of children is almost always small. The performance difference proves enough of them are large enough for the sorting to matter. The variance between machine types suggests some machines have nodes with a particularly large number of children. * I found a memory leak in the same commit. Plugging it makes the test slightly slower for me. I'll post the fix. * Since some of the n appear to be large enough for n^2 to matter, I tried sorting more efficiently. With g_queue_insert_sorted() replaced by g_queue_push_head() & g_queue_sort(), the test without -m slow is *slower* : 80s instead of 72s. Same for g_slist_prepend() & g_slist_sort(). I believe that's because the n are indeed small for the none machine. -m slow gets faster: 34 minutes instead of one hour (44 introspections per second). Better, I wouldn't call it good. * These modest time savings are dwarved many times over by Daniel's patch. With his patch and my leak fix, the test takes 51s without -m slow (down from 72s), and 7:29 minutes with -m slow (down from 1h). The test is still a pig, but anything that tests a hundred thousand QMP commands is bound to be one. My "info qom-tree" improvements don't matter anymore then, at least for this test: 50s vs. 51s, and 449s vs. 456s. * Fixing the msf2-soc bug more than *doubles* the run time with -m slow: from 7.5 to 18 minutes. * Measurements in tabular form: without -m slow with -m slow master 72s (100 / s) ~1h ( 24 / s) sort reverted 43s (170 / s) 362s (250 / s) sort redone, leak plugged 80s ( 90 / s) ~34m ( 44 / s) Daniels' patch, leak plugged 51s (170 / s) 456s (200 / s) & sort redone 50s (150 / s) 449s (200 / s) & msf2-soc fixed no change ~18m (240 / s) Conclusions: * The test matrix is *expensive*. Testing even a simple QMP query is when you do it a quarter million times. ARM is the greediest pig by far (170k introspections, almost two thirds of the total!), followed by ppc (36k), x86 (12k) and mips (11k). Ideas on trimming excess are welcome. I'm not even sure anymore this should be a qtest. * Running "info qom-tree" and "info qtree" in the order of a hundred thousand times is a bad idea. Daniel's patch makes sense. * Making info qom-tree build unsorted lists, then sorting them efficiently only matters when you run "info qom-tree" way more than you should. It helps more with machines where longer lists exist than it hurts with machines where the lists are all short. > The cost of this change is that any QOM bugs resulting in the test > failure will not be directly associated with the device that caused > the failure. The test case is not frequently identifying such bugs > though, Most likely because it identifies them long before the patches hit the list :) > and the cause is likely easily visible in the patch series > that causes the failure. So overall the shorter running time is > considered the more important factor. The time savings are simply too large to be left on the table. I agree with the idea that most of the time the offending device should be fairly obvious. But when it isn't, having a way to let the test do the work of pinpointing the device would be sooo nice. Can we keep the slow & thorough checking as a run-time option? > Signed-off-by: Daniel P. Berrangé <berra...@redhat.com> Thanks!