On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <faro...@suse.de> writes:
> 
> > Peter Xu <pet...@redhat.com> writes:
> >
> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote:
> >>> > I don't think we're discussing total CI time at this point, so the math
> >>> > doesn't really add up. We're not looking into making the CI finish
> >>> > faster. We're looking into making migration-test finish faster. That
> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
> >>> > of random race conditions* affecting other people/staging runs.
> >>> 
> >>> Right. The reason migration-test appears on my radar is because
> >>> it is very frequently the thing that shows up as "this sometimes
> >>> just fails or just times out and if you hit retry it goes away
> >>> again". That might not be migration-test's fault specifically,
> >>> because those retries tend to be certain CI configs (s390,
> >>> the i686-tci one), and I have some theories about what might be
> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
> >>> which means 8 QEMU processes which is too many for the number
> >>> of host CPUs). But right now I look at CI job failures and my reaction
> >>> is "oh, it's the migration-test failing yet again" :-(
> >>> 
> >>> For some examples from this week:
> >>> 
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
> >>
> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
> >> also see how speeding up / reducing tests in migration test would help, as
> >> that's (from some degree..) is the same as tuning the timeout value bigger.
> >> When the tests are less it'll fit into 480s window, but maybe it's too
> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
> >> can timeout again when on a busy host with less capability of concurrency.
> >>
> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
> >> here before the log expires:
> >>
> >> =================================8<================================
> >>
> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
> >>
> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                  
> >>        ERROR          144.32s   killed by signal 6 SIGABRT
> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
> >>>>> G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
> >>>>>  
> >>>>> PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3
> >>>>>  QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 
> >>>>> QTEST_QEMU_BINARY=./qemu-system-i386 
> >>>>> /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test
> >>>>>  --tap -k
> >> ――――――――――――――――――――――――――――――――――――― ✀  
> >> ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 
> >> 11 (Segmentation fault) (core dumped)
> >> (test program exited with status code -6)
> >> TAP parsing error: Too few tests run (expected 53, got 39)
> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>
> >> # Start of plain tests
> >> # Running /i386/migration/multifd/tcp/plain/cancel
> >> # Using machine type: pc-i440fx-9.2
> >> # starting QEMU: exec ./qemu-system-i386 -qtest 
> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev 
> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon 
> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg 
> >> -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial 
> >> file:/tmp/migration-test-4112T2/src_serial -drive 
> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device 
> >> ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >> # starting QEMU: exec ./qemu-system-i386 -qtest 
> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev 
> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon 
> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg 
> >> -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial 
> >> file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive 
> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device 
> >> ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> >> ----------------------------------- stderr 
> >> -----------------------------------
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 
> >> 11 (Segmentation fault) (core dumped)
> >>
> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
> >>
> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                
> >>        ERROR          381.00s   killed by signal 6 SIGABRT
> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 
> >>>>> QTEST_QEMU_IMG=./qemu-img 
> >>>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
> >>>>>  QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 
> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
> >>>>> /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >> ――――――――――――――――――――――――――――――――――――― ✀  
> >> ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was 
> >> non-properly terminated.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 
> >> 11 (Segmentation fault) (core dumped)
> >> (test program exited with status code -6)
> >> TAP parsing error: Too few tests run (expected 61, got 47)
> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >>
> >> # Start of plain tests
> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
> >> # Using machine type: pseries-9.2
> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest 
> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev 
> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control 
> >> -display none -audio none -accel kvm -accel tcg -machine 
> >> pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial 
> >> file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine 
> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios 
> >> /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest 
> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev 
> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control 
> >> -display none -audio none -accel kvm -accel tcg -machine 
> >> pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial 
> >> file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults 
> >> -machine 
> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios 
> >> /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
> >> ----------------------------------- stderr 
> >> -----------------------------------
> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was 
> >> non-properly terminated.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> warning: fd: migration to a file is deprecated. Use file: instead.
> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 
> >> 11 (Segmentation fault) (core dumped)
> >>
> >> (test program exited with status code -6)
> >> =================================8<================================
> >>
> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
> >> different host / arch being tested.  What is more weird is the two failures
> >> are different, the 2nd failure throw out a TLS error even though the test
> >> doesn't yet have tls involved.
> >
> > I think that's just a parallel test being cancelled prematurely, either
> > due to the crash or due to the timeout.
> >
> >>
> >> Fabiano, is this the issue you're looking at?
> >
> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
> > aarch64.
> >
> 
> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
> zero page causing multiple page faults"), the multifd code started using
> the rb->receivedmap bitmap, which belongs to the ram code and is
> initialized and *freed* from the ram SaveVMHandlers.
> 
> process_incoming_migration_co()        ...
>   qemu_loadvm_state()                  multifd_nocomp_recv()
>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>       rb->receivedmap = NULL               set_bit_atomic(..., 
> rb->receivedmap)
>   ...
>   migration_incoming_state_destroy()
>     multifd_recv_cleanup()
>       multifd_recv_terminate_threads(NULL)
> 
> Multifd threads are live until migration_incoming_state_destroy(), which
> is called some time later.

Thanks for the debugging.  Hmm I would expect loadvm should wait until all
ram is received somehow..

And when looking, I also found we're ambiguous on the cleanups
now.. probably a separate issue that not yet cause crashes.  I meant we
even don't have a consistent order to clean these in precopy / postcopy,
as:

We have this:

  qemu_loadvm_state
    qemu_loadvm_state_cleanup              <------------- [1]
  migration_bh_schedule(process_incoming_migration_bh, mis);
  (then in bh...)
  process_incoming_migration_bh
    migration_incoming_state_destroy       <------------- [2]

But then:
  
  postcopy_ram_listen_thread
    migration_incoming_state_destroy       <------------- [2]
    qemu_loadvm_state_cleanup              <------------- [1]

I think it makes more sense to do [2] after [1], so maybe postcopy needs
changing too..

> 
> >> Peter, do you think it'll be helpful if we temporarily mark this test as
> >> "slow" too so it's not run in CI (so we still run it ourselves when prepare
> >> migration PR, with the hope that it can reproduce)?
> >>
> >> Thanks,
> 

-- 
Peter Xu


Reply via email to