Re: [openstack-dev] Top Gate Bugs

Matt Riedemann Fri, 06 Dec 2013 08:34:03 -0800


On Wednesday, December 04, 2013 7:22:23 AM, Joe Gordon wrote:

TL;DR: Gate is failing 23% of the time due to bugs in nova, neutron
and tempest. We need help fixing these bugs.

Hi All,

Before going any further we have a bug that is affecting gate and
stable, so its getting top priority here. elastic-recheck currently
doesn't track unit tests because we don't expect them to fail very
often. Turns out that assessment was wrong, we now have a nova py27
unit test bug in gate and stable gate.

https://bugs.launchpad.net/nova/+bug/1216851
Title: nova unit tests occasionally fail migration tests for mysql and
postgres
Hits
FAILURE: 74
The failures appear multiple times for a single job, and some of those
are due to bad patches in the check queue. But this is being seen in
stable and trunk gate so something is definitely wrong.

=======

Its time for another edition of of 'Top Gate Bugs.' I am sending this
out now because in addition to our usual gate bugs a few new ones have
cropped up recently, and as we saw a few weeks ago it doesn't take
very many new bugs to wedge the gate.

Currently the gate has a failure rate of at least 23%! [0]

Note: this email was generated with
http://status.openstack.org/elastic-recheck/ and
'elastic-recheck-success' [1]

1) https://bugs.launchpad.net/bugs/1253896
Title: test_minimum_basic_scenario fails with SSHException: Error
reading SSH protocol banner
Projects: neutron, nova, tempest
Hits
FAILURE: 324
This one has been around for several weeks now and although we have
made some attempts at fixing this, we aren't any closer at resolving
this then we were a few weeks ago.

2) https://bugs.launchpad.net/bugs/1251448
Title: BadRequest: Multiple possible networks found, use a Network ID
to be more specific.
Project: neutron
Hits
FAILURE: 141

3) https://bugs.launchpad.net/bugs/1249065
Title: Tempest failure: tempest/scenario/test_snapshot_pattern.py
Project: nova
Hits
FAILURE: 112
This is a bug in nova's neutron code.

4) https://bugs.launchpad.net/bugs/1250168
Title: gate-tempest-devstack-vm-neutron-large-ops is failing
Projects: neutron, nova
Hits
FAILURE: 94
This is an old bug that was fixed, but came back on December 3rd. So
this is a recent regression. This may be an infra issue.

5) https://bugs.launchpad.net/bugs/1210483
Title: ServerAddressesTestXML.test_list_server_addresses FAIL
Projects: neutron, nova
Hits
FAILURE: 73
This has had some attempts made at fixing it but its still around.

In addition to the existing bugs, we have some new bugs on the rise:

1) https://bugs.launchpad.net/bugs/1257626
Title: Timeout while waiting on RPC response - topic: "network", RPC
method: "allocate_for_instance" info: "<unknown>"
Project: nova
Hits
FAILURE: 52
large-ops only bug. This has been around for at least two weeks, but
we have seen this in higher numbers starting around December 3rd. This
may be an infrastructure issue as the neutron-large-ops started
failing more around the same time.

2) https://bugs.launchpad.net/bugs/1257641
Title: Quota exceeded for instances: Requested 1, but already used 10
of 10 instances
Projects: nova, tempest
Hits
FAILURE: 41
Like the previous bug, this has been around for at least two weeks but
appears to be on the rise.

Raw Data: http://paste.openstack.org/show/54419/

best,
Joe

[0] failure rate = 1-(success rate gate-tempest-dsvm-neutron)*(success
rate ...) * ...

gate-tempest-dsvm-neutron = 0.00
gate-tempest-dsvm-neutron-large-ops = 11.11
gate-tempest-dsvm-full = 11.11
gate-tempest-dsvm-large-ops = 4.55
gate-tempest-dsvm-postgres-full = 10.00
gate-grenade-dsvm = 0.00

(I hope I got the math right here)

[1]
http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/cmd/check_success.py

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Let's add bug 1257644 [1] to the list. I'm pretty sure this is due tosome recent code [2][3] in the nova libvirt driver that isautomatically disabling the host when the libvirt connection drops.

Joe said there was a known issue with libvirt connection failures sothis could be duped against that, but I'm not sure where/what that oneis - maybe bug 1254872 [4]?

Unless I just don't understand the code, there is some funny logicgoing on in the libvirt driver when it's automatically disabling a hostwhich I've documented in bug 1257644. It would help to have somelibvirt-minded people helping to look at that, or the authors/approversof those patches.

Also, does anyone know if libvirt will pass a 'reason' string to the_close_callback function? I was digging through the libvirt code thismorning but couldn't figure out where the callback is actually calledand with what parameters. The code in nova seemed to just be based onthe patch that danpb had in libvirt [5].

This bug is going to raise a bigger long-term question about the needfor having a new column in the Service table for indicating whether ornot the service was automatically disabled, as Phil Day points out inbug 1250049 [6]. That way the ComputeFilter in the scheduler couldhandle that case a bit differently, at least from alogging/serviceability standpoint, e.g. info/warning level message vsdebug.


[1] https://bugs.launchpad.net/nova/+bug/1257644
[2] https://review.openstack.org/#/c/52189/
[3] https://review.openstack.org/#/c/56224/
[4] https://bugs.launchpad.net/nova/+bug/1254872
[5] http://www.redhat.com/archives/libvir-list/2012-July/msg01675.html
[6] https://bugs.launchpad.net/nova/+bug/1250049

--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Top Gate Bugs

Reply via email to