On 12/12/2013 7:20 AM, Sean Dague wrote:
Current Gate Length: 12hrs*, 41 deep (top of gate entered 12hrs ago) It's been an *exciting* week this week. For people not paying attention we had 2 external events which made things terrible earlier in the week. ========================== Event 1: sphinx 1.2 complete breakage - MOSTLY RESOLVED ========================== It turns out sphinx 1.2 + distutils (which pbr magic call through) means total sadness. The fix for this was a requirements pin to sphinx < 1.2, and until a project has taken that they will fail in the gate. It also turns out that tox installs pre-released software by default (a terrible default behavior), so you also need a tox.ini change like this - https://github.com/openstack/nova/blob/master/tox.ini#L9 otherwise local users will install things like sphinx 1.2b3. They will also break in other ways. Not all projects have merged this. If you are a project that hasn't, please don't send any other jobs to the gate until you do. A lot of delay was added to the gate yesterday by Glance patches being pushed to the gate before their doc jobs were done. ========================== Event 2: apt.puppetlabs.com outage - RESOLVED ========================== We use that apt repository to setup the devstack nodes in nodepool with puppet. We were triggering an issue with grenade where it's apt-get calls were failing, because it does apt-get update once to make sure life is good. This only triggered in grenade (noth other devstack runs) because we do set -o errexit aggressively. A fix in grenade to ignore these errors was merged yesterday afternoon (the purple line - http://status.openstack.org/elastic-recheck/ you can see where it showed up). ========================== Top Gate Bugs ========================== We normally do this as a list, and you can see the whole list here - http://status.openstack.org/elastic-recheck/ (now sorted by number of FAILURES in the last 2 weeks) That being said, our bigs race bug is currently this one bug - https://bugs.launchpad.net/tempest/+bug/1253896 - and if you want to merge patches, fixing that one bug will be huge. Basically, you can't ssh into guests that get created. That's sort of a fundamental property of a cloud. It shows up more frequently on neutron jobs, possibly due to actually testing the metadata server path. There have been many attempts on retry logic on this, we actually retry for 196 seconds to get in and only fail once we can't get in, so waiting isn't helping. It doesn't seem like the env is under that much load. Until we resolve this, life will not be good in landing patches. -Sean _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
There have been a few threads [1][2] on gate failures and the process around what happens when we go about identifying, tracking and fixing them.
I couldn't find anything outside of the mailing list to keep a record of this so started a page here [3].
Feel free to contribute so we can point people to how they can easily help in working these faster.
[1] http://lists.openstack.org/pipermail/openstack-dev/2013-November/020280.html [2] http://lists.openstack.org/pipermail/openstack-dev/2013-November/019931.html
[3] https://wiki.openstack.org/wiki/ElasticRecheck -- Thanks, Matt Riedemann _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev