Joe, What about running benchmarks (with small load), for all major functions (like snaphshoting, booting/deleting, ..) on every patch in nova. It can catch a lot of related stuff.
Best regards, Boris Pavlovic On Wed, Jul 9, 2014 at 1:56 AM, Michael Still <mi...@stillhq.com> wrote: > The associated bug says this is probably a qemu bug, so I think we > should rephrase that to "we need to start thinking about how to make > sure upstream changes don't break nova". > > Michael > > On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon <joe.gord...@gmail.com> wrote: > > > > On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange <berra...@redhat.com > > > > wrote: > >> > >> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote: > >> > While the Trusty transition was mostly uneventful, it has exposed a > >> > particular issue in libvirt, which is generating ~ 25% failure rate > now > >> > on most tempest jobs. > >> > > >> > As can be seen here - > >> > > >> > > https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297 > >> > > >> > > >> > ... the libvirt live_snapshot code is something that our test pipeline > >> > has never tested before, because it wasn't a new enough libvirt for us > >> > to take that path. > >> > > >> > Right now it's exploding, a lot - > >> > https://bugs.launchpad.net/nova/+bug/1334398 > >> > > >> > Snapshotting gets used in Tempest to create images for testing, so > image > >> > setup tests are doing a decent number of snapshots. If I had to take a > >> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots > at > >> > the same time. It's probably something that most people haven't hit. > The > >> > wild guess is based on other libvirt issues we've hit that other > people > >> > haven't, and they are basically always a parallel ops triggered > problem. > >> > > >> > My 'stop the bleeding' suggested fix is this - > >> > https://review.openstack.org/#/c/102643/ which just effectively > disables > >> > this code path for now. Then we can get some libvirt experts engaged > to > >> > help figure out the right long term fix. > >> > >> Yes, this is a sensible pragmatic workaround for the short term until > >> we diagnose the root cause & fix it. > >> > >> > I think there are a couple: > >> > > >> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so > >> > mandate at some known working version. This would actually take a > bunch > >> > of work to be able to test a non packaged libvirt in our pipeline. > We'd > >> > need volunteers for that. > >> > > >> > 2) lock snapshot operations in nova-compute, so that we can only do 1 > at > >> > a time. Hopefully it's just 2 snapshot operations that is the issue, > not > >> > any other libvirt op during a snapshot, so serializing snapshot ops in > >> > n-compute could put the kid gloves on libvirt and make it not break > >> > here. This also needs some volunteers as we're going to be playing a > >> > game of progressive serialization until we get to a point where it > looks > >> > like the failures go away. > >> > > >> > 3) Roll back to precise. I put this idea here for completeness, but I > >> > think it's a terrible choice. This is one isolated, previously > untested > >> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so > actually > >> > need to fix this for real (be it in nova's use of libvirt, or libvirt > >> > itself). > >> > >> Yep, since we *never* tested this code path in the gate before, rolling > >> back to precise would not even really be a fix for the problem. It would > >> merely mean we're not testing the code path again, which is really akin > >> to sticking our head in the sand. > >> > >> > But for right now, we should stop the bleeding, so that nova/libvirt > >> > isn't blocking everyone else from merging code. > >> > >> Agreed, we should merge the hack and treat the bug as release blocker > >> to be resolve prior to Juno GA. > > > > > > > > How can we prevent libvirt issues like this from landing in trunk in the > > first place? If we don't figure out a way to prevent this from landing > the > > first place I fear we will keep repeating this same pattern of failure. > > > >> > >> > >> Regards, > >> Daniel > >> -- > >> |: http://berrange.com -o- > http://www.flickr.com/photos/dberrange/ > >> :| > >> |: http://libvirt.org -o- > http://virt-manager.org > >> :| > >> |: http://autobuild.org -o- > http://search.cpan.org/~danberr/ > >> :| > >> |: http://entangle-photo.org -o- > http://live.gnome.org/gtk-vnc > >> :| > >> > >> _______________________________________________ > >> OpenStack-dev mailing list > >> OpenStack-dev@lists.openstack.org > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > -- > Rackspace Australia > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev