Re: [PATCH] tests: improve performance of device-introspect-test

Markus Armbruster Fri, 10 Jul 2020 13:04:55 -0700

Daniel P. Berrangé <berra...@redhat.com> writes:

> Total execution time with "-m slow" and x86_64 QEMU, drops from 3
> minutes 15 seconds, down to 54 seconds.
>
> Individual tests drop from 17-20 seconds, down to 3-4 seconds.


Nice!

A few observations on this test (impatient readers may skip to
"Conclusions"):

* Your testing rig is a fair bit faster than mine.

* Run time is almost completely spent in the
  /TARGET/device/introspect/concrete/*/* cases.

  Each such case iterates over all known device types.  Each iteration
  introspects one device.

  Cases concrete/*/M use machine type M.  The test ignores "old"
  versioned machine types; see qtest_is_old_versioned_machine().

  Cases concrete/nodefaults/* use -nodefaults, cases concrete/defaults/*
  don't.

* The number of known device types varies between targets from 33
  (tricore) to several hundreds (x86_64+i386: 421, ppc 593, arm 667,
  aarch64 680, ppc64 689).  Median is 215, sum is 7485.

  Aside: why on earth do we need 334 ppc64 CPUs?

* The number of tested machine types varies between targets from two to
  over 60 (arm 62, aarch64 66).  Median is 4, sum is 286.

* Without -m slow, only /TARGET/device/introspect/concrete/defaults/none
  is run.

  This tests #devices introspections, i.e. between 33 and 689 depending
  on target, 7485 for all targets together.

  Run times for master (commit eb2c66b10e) vary for me from well under
  one second to almost 6 seconds, with x86_64 and i386 the slowest,
  followed by aarch64, arm, ppc64, pcc. s390x.  This is what "make
  check" runs.  Total run time for all 32 targets is ~72s.  That's
  roughly 100 introspections per second.

  Note I measured manual runs, like so:

  $ QEMU_AUDIO_DRV=none QTEST_QEMU_BINARY="$TARGET-softmmu/qemu-system-$TARGET" 
QTEST_QEMU_IMG=qemu-img time tests/qtest/device-introspect-test

* With -m slow, we run all cases.  This is relatively recent: commit
  410573aa2c 'tests/device-introspect: Test with all machines, not only
  with "none"'.  The commit message explains:

    Certain device introspection crashes used to only happen if you were
    using a certain machine, e.g. if the machine was using serial_hd() or
    nd_table[], and a device was trying to use these in its instance_init
    function, too.

  Aside: these are things an instance_init() should not do.  But
  catching things that should not be done is one reason for having
  tests.  This one is an awfully expensive catcher, but until someone
  has better ideas...

  With -m slow, we test 2 * #machines * #devices introspections,
  i.e. from 132 (tricore) to over 10k (ppc 13046, ppc64 23426, arm
  82708, aarch64 89760).  Median is ~1600, sum is ~260k.

  Except we actually test just 89k now, because the test *fails* for arm
  and aarch64 after some 500 introspections: introspecting device
  msf2-soc with machine ast2600-evb makes QEMU terminate unsuccessfully
  with "Unsupported NIC model: ftgmac100".  Cause: m2sxxx_soc_initfn()
  calls qemu_check_nic_model().  Goes back to commit 05b7374a58 "msf2:
  Add EMAC block to SmartFusion2 SoC", merged some ten weeks ago.  This
  is exactly the kind of mistake the test is designed to catch.  How
  come it wasn't?  Possibly due to unlucky combination with the slowdown
  discussed in the next item (but that was less than four weeks ago).
  I'll post a patch to fix the bug.

  Run time increases even more than the number of introspections.  Total
  run time for all targets is around one hour, which is less than 25
  introspections per second.  ppc64 is the worst by far: 26 *minutes*
  for 23k tests, less than 15 introspections per second.  sparc and rx
  achieve a paltry 5 introspections per second.

* Recent commit e8c9e65816 'qom: Make "info qom-tree" show children
  sorted' made things worse.  Reverting it cuts run time without -m slow
  from 72s to 43s (100 and 170 introspections per second, respectively),
  and with -m slow from one hour to 6 minutes (24 and 250 introspections
  per second, respectively).

  The speedup with -m slow varies much between targets: from less than 2
  (cris, moxie, nios2, or1k, s390x, tricore, unicore32) to more than 20
  (ppc64, rx, sparc).  The most sluggish introspections profit the most.

  Paolo pointed out that the sorting is n^2.  True.  I did not expect n
  to be large enough for that to matter.  What's being sorted is the
  children of each node in the QOM composition tree.  The number of
  children is almost always small.  The performance difference proves
  enough of them are large enough for the sorting to matter.  The
  variance between machine types suggests some machines have nodes with
  a particularly large number of children.

* I found a memory leak in the same commit.  Plugging it makes the test
  slightly slower for me.  I'll post the fix.

* Since some of the n appear to be large enough for n^2 to matter, I
  tried sorting more efficiently.

  With g_queue_insert_sorted() replaced by g_queue_push_head() &
  g_queue_sort(), the test without -m slow is *slower* : 80s instead of
  72s.  Same for g_slist_prepend() & g_slist_sort().  I believe that's
  because the n are indeed small for the none machine.

  -m slow gets faster: 34 minutes instead of one hour (44 introspections
  per second).  Better, I wouldn't call it good.

* These modest time savings are dwarved many times over by Daniel's
  patch.  With his patch and my leak fix, the test takes 51s without -m
  slow (down from 72s), and 7:29 minutes with -m slow (down from 1h).
  The test is still a pig, but anything that tests a hundred thousand
  QMP commands is bound to be one.

  My "info qom-tree" improvements don't matter anymore then, at least
  for this test: 50s vs. 51s, and 449s vs. 456s.

* Fixing the msf2-soc bug more than *doubles* the run time with -m slow:
  from 7.5 to 18 minutes.

* Measurements in tabular form:

                                without -m slow     with -m slow
  master                        72s  (100 / s)      ~1h  ( 24 / s)
  sort reverted                 43s  (170 / s)      362s (250 / s)
  sort redone, leak plugged     80s  ( 90 / s)      ~34m ( 44 / s)
  Daniels' patch, leak plugged  51s  (170 / s)      456s (200 / s)
  & sort redone                 50s  (150 / s)      449s (200 / s)
  & msf2-soc fixed              no change           ~18m (240 / s)

Conclusions:

* The test matrix is *expensive*.  Testing even a simple QMP query is
  when you do it a quarter million times.  ARM is the greediest pig by
  far (170k introspections, almost two thirds of the total!), followed
  by ppc (36k), x86 (12k) and mips (11k).  Ideas on trimming excess are
  welcome.  I'm not even sure anymore this should be a qtest.

* Running "info qom-tree" and "info qtree" in the order of a hundred
  thousand times is a bad idea.  Daniel's patch makes sense.

* Making info qom-tree build unsorted lists, then sorting them
  efficiently only matters when you run "info qom-tree" way more than
  you should.  It helps more with machines where longer lists exist than
  it hurts with machines where the lists are all short.

> The cost of this change is that any QOM bugs resulting in the test
> failure will not be directly associated with the device that caused
> the failure. The test case is not frequently identifying such bugs
> though,

Most likely because it identifies them long before the patches hit the
list :)

>         and the cause is likely easily visible in the patch series
> that causes the failure. So overall the shorter running time is
> considered the more important factor.

The time savings are simply too large to be left on the table.

I agree with the idea that most of the time the offending device should
be fairly obvious.  But when it isn't, having a way to let the test do
the work of pinpointing the device would be sooo nice.  Can we keep the
slow & thorough checking as a run-time option?

> Signed-off-by: Daniel P. Berrangé <berra...@redhat.com>

Thanks!

Re: [PATCH] tests: improve performance of device-introspect-test

Reply via email to