On Mon, Aug 7, 2017 at 2:52 AM, Jakub Libosvar <jlibo...@redhat.com> wrote: > Hi all, > > as per grafana [1] the functional job is broken. Looking at logstash [2] > it started happening consistently since 2017-08-03 16:27. I didn't find > any particular patch in Neutron that could cause it. > > The culprit is that ovsdb starts misbehaving [3] and then we retry calls > indefinitely. We still use 2.5.2 openvswitch as we had before. I opened > a bug [4] and started investigation, I'll update my findings there. > > I think at this point there is no reason to run "recheck" on your patches. > > Thanks, > Jakub > > [1] > http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen > [2] http://bit.ly/2vdKMwy > [3] > http://logs.openstack.org/14/488914/8/check/gate-neutron-dsvm-functional-ubuntu-xenial/75d7482/logs/openvswitch/ovsdb-server.txt.gz > [4] https://bugs.launchpad.net/neutron/+bug/1709032
Considering all the instability of the job we see lately (this bug being the latest hit, but we also have bug https://bugs.launchpad.net/neutron/+bug/1707933, close release, and no significant resources on digging the issue, I propose to temporarily disable the job: https://review.openstack.org/#/c/491548/. I also suggest our mighty leadership to harness awareness of the issue and rally troops to get it solved. (to reply to Kevin's request in IRC) To recap what happened with timeout bug: https://bugs.launchpad.net/neutron/+bug/1707933, it popped up ~ month ago in master, but it hits Ocata branch too (so it's either a recent backport, or some external dependency). The way it happens is one of test worker (almost always running a FirewallTestCase test case) dies in the middle of run (you can see 'Killed' message in console log, and most of the times, you can also see the job taking ~2h and the last test worker dying with 'inprogress' state). The first hypothesis was that some (other?) test case calls execute(['kill', ...]) with the worker PID. To check that, Jakub proposed https://review.openstack.org/#/c/487065/ and rechecked for a while until the bug was triggered in the gate. The collected log suggested that kill was NOT called with the PID. The next step could be catching all os.kill() calls in all functional tests and logging their arguments somewhere (with call stacks). We were thinking of mocking os.kill, replacing it with a function that would log and pass it to the original implementation, but didn't have time for that so far. Regards, Ihar __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev