On Wednesday, December 04, 2013 7:22:23 AM, Joe Gordon wrote:
TL;DR: Gate is failing 23% of the time due to bugs in nova, neutron and tempest. We need help fixing these bugs. Hi All, Before going any further we have a bug that is affecting gate and stable, so its getting top priority here. elastic-recheck currently doesn't track unit tests because we don't expect them to fail very often. Turns out that assessment was wrong, we now have a nova py27 unit test bug in gate and stable gate. https://bugs.launchpad.net/nova/+bug/1216851 Title: nova unit tests occasionally fail migration tests for mysql and postgres Hits FAILURE: 74 The failures appear multiple times for a single job, and some of those are due to bad patches in the check queue. But this is being seen in stable and trunk gate so something is definitely wrong. ======= Its time for another edition of of 'Top Gate Bugs.' I am sending this out now because in addition to our usual gate bugs a few new ones have cropped up recently, and as we saw a few weeks ago it doesn't take very many new bugs to wedge the gate. Currently the gate has a failure rate of at least 23%! [0] Note: this email was generated with http://status.openstack.org/elastic-recheck/ and 'elastic-recheck-success' [1] 1) https://bugs.launchpad.net/bugs/1253896 Title: test_minimum_basic_scenario fails with SSHException: Error reading SSH protocol banner Projects: neutron, nova, tempest Hits FAILURE: 324 This one has been around for several weeks now and although we have made some attempts at fixing this, we aren't any closer at resolving this then we were a few weeks ago. 2) https://bugs.launchpad.net/bugs/1251448 Title: BadRequest: Multiple possible networks found, use a Network ID to be more specific. Project: neutron Hits FAILURE: 141 3) https://bugs.launchpad.net/bugs/1249065 Title: Tempest failure: tempest/scenario/test_snapshot_pattern.py Project: nova Hits FAILURE: 112 This is a bug in nova's neutron code. 4) https://bugs.launchpad.net/bugs/1250168 Title: gate-tempest-devstack-vm-neutron-large-ops is failing Projects: neutron, nova Hits FAILURE: 94 This is an old bug that was fixed, but came back on December 3rd. So this is a recent regression. This may be an infra issue. 5) https://bugs.launchpad.net/bugs/1210483 Title: ServerAddressesTestXML.test_list_server_addresses FAIL Projects: neutron, nova Hits FAILURE: 73 This has had some attempts made at fixing it but its still around. In addition to the existing bugs, we have some new bugs on the rise: 1) https://bugs.launchpad.net/bugs/1257626 Title: Timeout while waiting on RPC response - topic: "network", RPC method: "allocate_for_instance" info: "<unknown>" Project: nova Hits FAILURE: 52 large-ops only bug. This has been around for at least two weeks, but we have seen this in higher numbers starting around December 3rd. This may be an infrastructure issue as the neutron-large-ops started failing more around the same time. 2) https://bugs.launchpad.net/bugs/1257641 Title: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances Projects: nova, tempest Hits FAILURE: 41 Like the previous bug, this has been around for at least two weeks but appears to be on the rise. Raw Data: http://paste.openstack.org/show/54419/ best, Joe [0] failure rate = 1-(success rate gate-tempest-dsvm-neutron)*(success rate ...) * ... gate-tempest-dsvm-neutron = 0.00 gate-tempest-dsvm-neutron-large-ops = 11.11 gate-tempest-dsvm-full = 11.11 gate-tempest-dsvm-large-ops = 4.55 gate-tempest-dsvm-postgres-full = 10.00 gate-grenade-dsvm = 0.00 (I hope I got the math right here) [1] http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/cmd/check_success.py _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Let's add bug 1257644 [1] to the list. I'm pretty sure this is due to some recent code [2][3] in the nova libvirt driver that is automatically disabling the host when the libvirt connection drops.
Joe said there was a known issue with libvirt connection failures so this could be duped against that, but I'm not sure where/what that one is - maybe bug 1254872 [4]?
Unless I just don't understand the code, there is some funny logic going on in the libvirt driver when it's automatically disabling a host which I've documented in bug 1257644. It would help to have some libvirt-minded people helping to look at that, or the authors/approvers of those patches.
Also, does anyone know if libvirt will pass a 'reason' string to the _close_callback function? I was digging through the libvirt code this morning but couldn't figure out where the callback is actually called and with what parameters. The code in nova seemed to just be based on the patch that danpb had in libvirt [5].
This bug is going to raise a bigger long-term question about the need for having a new column in the Service table for indicating whether or not the service was automatically disabled, as Phil Day points out in bug 1250049 [6]. That way the ComputeFilter in the scheduler could handle that case a bit differently, at least from a logging/serviceability standpoint, e.g. info/warning level message vs debug.
[1] https://bugs.launchpad.net/nova/+bug/1257644 [2] https://review.openstack.org/#/c/52189/ [3] https://review.openstack.org/#/c/56224/ [4] https://bugs.launchpad.net/nova/+bug/1254872 [5] http://www.redhat.com/archives/libvir-list/2012-July/msg01675.html [6] https://bugs.launchpad.net/nova/+bug/1250049 -- Thanks, Matt Riedemann _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev