On Wednesday, December 04, 2013 7:22:23 AM, Joe Gordon wrote:
TL;DR: Gate is failing 23% of the time due to bugs in nova, neutron
and tempest. We need help fixing these bugs.


Hi All,

Before going any further we have a bug that is affecting gate and
stable, so its getting top priority here. elastic-recheck currently
doesn't track unit tests because we don't expect them to fail very
often. Turns out that assessment was wrong, we now have a nova py27
unit test bug in gate and stable gate.

https://bugs.launchpad.net/nova/+bug/1216851
Title: nova unit tests occasionally fail migration tests for mysql and
postgres
Hits
  FAILURE: 74
The failures appear multiple times for a single job, and some of those
are due to bad patches in the check queue.  But this is being seen in
stable and trunk gate so something is definitely wrong.

=======


Its time for another edition of of 'Top Gate Bugs.'  I am sending this
out now because in addition to our usual gate bugs a few new ones have
cropped up recently, and as we saw a few weeks ago it doesn't take
very many new bugs to wedge the gate.

Currently the gate has a failure rate of at least 23%! [0]

Note: this email was generated with
http://status.openstack.org/elastic-recheck/ and
'elastic-recheck-success' [1]

1) https://bugs.launchpad.net/bugs/1253896
Title: test_minimum_basic_scenario fails with SSHException: Error
reading SSH protocol banner
Projects:  neutron, nova, tempest
Hits
  FAILURE: 324
This one has been around for several weeks now and although we have
made some attempts at fixing this, we aren't any closer at resolving
this then we were a few weeks ago.

2) https://bugs.launchpad.net/bugs/1251448
Title: BadRequest: Multiple possible networks found, use a Network ID
to be more specific.
Project: neutron
Hits
  FAILURE: 141

3) https://bugs.launchpad.net/bugs/1249065
Title: Tempest failure: tempest/scenario/test_snapshot_pattern.py
Project: nova
Hits
  FAILURE: 112
This is a bug in nova's neutron code.

4) https://bugs.launchpad.net/bugs/1250168
Title: gate-tempest-devstack-vm-neutron-large-ops is failing
Projects: neutron, nova
Hits
  FAILURE: 94
This is an old bug that was fixed, but came back on December 3rd. So
this is a recent regression. This may be an infra issue.

5) https://bugs.launchpad.net/bugs/1210483
Title: ServerAddressesTestXML.test_list_server_addresses FAIL
Projects: neutron, nova
Hits
  FAILURE: 73
This has had some attempts made at fixing it but its still around.


In addition to the existing bugs, we have some new bugs on the rise:

1) https://bugs.launchpad.net/bugs/1257626
Title: Timeout while waiting on RPC response - topic: "network", RPC
method: "allocate_for_instance" info: "<unknown>"
Project: nova
Hits
  FAILURE: 52
large-ops only bug. This has been around for at least two weeks, but
we have seen this in higher numbers starting around December 3rd. This
may  be an infrastructure issue as the neutron-large-ops started
failing more around the same time.

2) https://bugs.launchpad.net/bugs/1257641
Title: Quota exceeded for instances: Requested 1, but already used 10
of 10 instances
Projects: nova, tempest
Hits
  FAILURE: 41
Like the previous bug, this has been around for at least two weeks but
appears to be on the rise.



Raw Data: http://paste.openstack.org/show/54419/


best,
Joe


[0] failure rate = 1-(success rate gate-tempest-dsvm-neutron)*(success
rate ...) * ...

gate-tempest-dsvm-neutron = 0.00
gate-tempest-dsvm-neutron-large-ops = 11.11
gate-tempest-dsvm-full = 11.11
gate-tempest-dsvm-large-ops = 4.55
gate-tempest-dsvm-postgres-full = 10.00
gate-grenade-dsvm = 0.00

(I hope I got the math right here)

[1]
http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/cmd/check_success.py


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Let's add bug 1257644 [1] to the list. I'm pretty sure this is due to some recent code [2][3] in the nova libvirt driver that is automatically disabling the host when the libvirt connection drops.

Joe said there was a known issue with libvirt connection failures so this could be duped against that, but I'm not sure where/what that one is - maybe bug 1254872 [4]?

Unless I just don't understand the code, there is some funny logic going on in the libvirt driver when it's automatically disabling a host which I've documented in bug 1257644. It would help to have some libvirt-minded people helping to look at that, or the authors/approvers of those patches.

Also, does anyone know if libvirt will pass a 'reason' string to the _close_callback function? I was digging through the libvirt code this morning but couldn't figure out where the callback is actually called and with what parameters. The code in nova seemed to just be based on the patch that danpb had in libvirt [5].

This bug is going to raise a bigger long-term question about the need for having a new column in the Service table for indicating whether or not the service was automatically disabled, as Phil Day points out in bug 1250049 [6]. That way the ComputeFilter in the scheduler could handle that case a bit differently, at least from a logging/serviceability standpoint, e.g. info/warning level message vs debug.

[1] https://bugs.launchpad.net/nova/+bug/1257644
[2] https://review.openstack.org/#/c/52189/
[3] https://review.openstack.org/#/c/56224/
[4] https://bugs.launchpad.net/nova/+bug/1254872
[5] http://www.redhat.com/archives/libvir-list/2012-July/msg01675.html
[6] https://bugs.launchpad.net/nova/+bug/1250049

--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to