On 6/30/2016 8:44 AM, Daniel P. Berrange wrote:
A bunch of people in Nova and upstream QEMU teams are trying to investigate
a long standing bug in live migration[1]. Unfortuntely the bug is rather
non-deterministic - eg on the multinode-live-migration tempest job it has
hit 4 times in 7 days, while on multinode-full tempest job it has hit
~70 times in 7 days.

For those that don't know, the multinode-live-migration job only runs the live migration tests in Tempest and it runs on ubuntu 16.04 nodes while the multinode-full job runs the live migration tests plus the normal tempest full job run, but on ubuntu 14.04 nodes. So the mn-live-migration job *may* be a bit more stable because it's running with newer libvirt/qemu.

We've at least noticed that another live migration bug isn't showing up on the dedicated xenial live migration job:

https://bugs.launchpad.net/nova/+bug/1539271


I have a test patch which hacks nova to download & install a special QEMU
build with extra debugging output[2]. Because of the non-determinism I need
to then run the multinode-live-migration & multinode-full tempest jobs
many times to try and catch the bug.  Doing this by just entering 'recheck'
is rather tedious because you have to wait for the 1+ hour turnaround time
between each recheck.

To get around this limitation I created a chain of 10 commits [3] which just
toggled some whitespace and uploaded them all, so I can get 10 CI runs
going in parallel. This worked remarkably well - at least enough to
reproduce the more common failure of multinode-full, but not enough for
the much rarer multinode-live-migration job.

The ascii art is a real treat.


I could expand this hack and upload 100 dummy changes to get more jobs
running to increase chances of hitting the multinode-live-migration
failure. Out of the 16 jobs run on every Nova change, I only care about
running 2 of them. So to get 100 runs of the 2 live migration jobs I want,
I'd be creating 1600 CI jobs in total which is not too nice for our CI
resource pool :-(

I'd really love it if there was

 1. the ability to request checking of just specific jobs eg

      "recheck gate-tempest-dsvm-multinode-full"

FWIW people have asked for this before. I think it would be OK if there were a way to not change the overall verification score somehow because it could potentially invalidate earlier runs where multiple jobs failed but you're only rechecking one of them.


 2. the ability to request this recheck to run multiple
    times in parallel. eg if i just repeat the 'recheck'
    command many times on the same patchset # without
    waiting for results

Any one got any other tips for debugging highly non-deterministic
bugs like this which only hit perhaps 1 time in 100, without wasting
huge amounts of CI resource as I'm doing right now ?

No one has ever been able to reproduce these failures outside of
the gate CI infra, indeed certain CI hosting providers seem worse
afffected by the bug than others, so running tempest locally is not
an option.

Good point on the node providers, I hadn't noticed that before, but it definitely looks to be hitting OVH and OSIC nodes more than any others:

http://goo.gl/f0coZb


Regards,
Daniel

[1] https://bugs.launchpad.net/nova/+bug/1524898
[2] https://review.openstack.org/#/c/335549/5
[3] https://review.openstack.org/#/q/topic:mig-debug



--

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to