+2 Sent from my really tiny device...
> On Nov 25, 2013, at 5:02 AM, "Davanum Srinivas" <dava...@gmail.com> wrote: > > Many thanks to everyone who helped with the many fixes. Kudos to > Joe/Clark for spear heading the effort! > > -- dims > >> On Mon, Nov 25, 2013 at 12:00 AM, Joe Gordon <joe.gord...@gmail.com> wrote: >> Hi All, >> >> TL;DR Last week the gate got wedged on nondeterministic failures. Unwedging >> the gate required drastic actions to fix bugs. >> >> Starting on November 15th, gate jobs have been getting progressively less >> stable with not enough attention given to fixing the issues, until we got to >> the point where the gate was almost fully wedged. No one bug caused this, >> it was a collection of bugs that got us here. The gate protects us from code >> that fails 100% of the time, but if a patch fails 10% of the time it can >> slip through. Add a few of these bugs together and we get the gate to a >> point where the gate is fully wedged and fixing it without circumventing the >> gate (something we never want to do) is very hard. It took just 2 new >> nondeterministic bugs to take us from a gate that mostly worked, to a gate >> that was almost fully wedged. Last week we found out Jeremy Stanley (fungi) >> was right when he said, "nondeterministic failures breed more >> nondeterministic failures, because people are so used to having to reverify >> their patches to get them to merge that they are doing so even when it's >> their patch which is introducing a nondeterministic bug." >> >> Side note: This is not the first time we wedge the gate, the first time was >> around September 26th, right when we were cutting Havana release candidates. >> In response we wrote elastic-recheck >> (http://status.openstack.org/elastic-recheck/) to better track what bugs we >> were seeing. >> >> Gate stability according to Graphite: http://paste.openstack.org/show/53765/ >> (they are huge because they encode entire queries, so including as a >> pastebin). >> >> After sending out an email to ask for help fixing the top known gate bugs >> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html), >> we had a few possible fixes. But with the gate wedged, the merge queue was >> 145 patches long and could take days to be processed. In the worst case, >> none of the patches merging, it would take about 1 hour per patch. So on >> November 20th we asked for a freeze on any non-critical bug fixes ( >> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html >> ), and kicked everything out of the merge queue and put our possible bug >> fixes at the front. Even with these drastic measures it still took 26 hours >> to finally unwedge the gate. In 26 hours we got the check queue failure rate >> (always higher then the gate failure rate) down from around 87% failure to >> below 10% failure. And we still have many more bugs to track down and fix in >> order to improve gate stability. >> >> >> 8 Major bug fixes later, we have the gate back to a reasonable failure rate. >> But how did things get so bad? I'm glad you asked, here is a blow by blow >> account. >> >> The gate has not been completely stable for a very long time, and it only >> took two new bugs to wedge the gate. Starting with the list of bugs we >> identified via elastic-recheck, we fixed 4 bugs that have been in the gate >> for a few weeks already. >> >> >> https://bugs.launchpad.net/bugs/1224001 "test_network_basic_ops fails >> waiting for network to become available" >> >> https://review.openstack.org/57290 was the fix which depended on >> https://review.openstack.org/53188 and https://review.openstack.org/57475. >> >> This fixed a race condition where the IP address from DHCP was not received >> by the VM at the right time. Minimize polling on the agent is now defaulted >> to True, which should reduce the time needed for configuring an interface on >> br-int consistently. >> >> https://bugs.launchpad.net/bugs/1252514 "Swift returning errors when setup >> using devstack" >> >> Fix https://review.openstack.org/#/c/57373/ >> >> There were a few swift related problems that were sorted out as well. Most >> had to do with tuning swift properly for its use as a glance backend in the >> gate, ensuring that timeout values were appropriate for the devstack test >> slaves (in >> >> resource constrained environments, the swift default timeouts could be >> tripped frequently (logs showed the request would have finished successfully >> given enough time)). Swift also had a race-condition in how it constructed >> its sqlite3 >> >> files for containers and accounts, where it was not retrying operations when >> the database was locked. >> >> https://bugs.launchpad.net/swift/+bug/1243973 "Simultaneous PUT requests for >> the same account..." >> >> Fix https://review.openstack.org/#/c/57019/ >> >> This was not on our original list of bugs, but while in bug fix mode, we got >> this one fixed as well >> >> https://bugs.launchpad.net/bugs/1251784 "nova+neutron scheduling error: >> Connection to neutron failed: Maximum attempts reached >> >> Fix https://review.openstack.org/#/c/57509/ >> >> Uncovered on mailing list >> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html) >> >> Nova had a very old version of oslo's local.py which is used for managing >> references to local variables in coroutines. The old version had a pretty >> significant bug that basically meant non-weak references to variables were >> not managed properly. This fix has made the nova neutron interactions much >> more reliable. >> >> This fixed the number 2 bug on our list of top gate bugs >> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html >> )! >> >> >> In addition to fixing 4 old bugs, we fixed two new bugs that were introduced >> / exposed this week. >> >> https://bugs.launchpad.net/bugs/1251920 "Tempest failures due to failure to >> return console logs from an instance Project" >> >> Bug: https://review.openstack.org/#/c/54363/ [Tempest] >> >> Fix(work around): https://review.openstack.org/#/c/57193/ >> >> After many false starts and banging our head against the wall, we identified >> a change to tempest, https://review.openstack.org/54363 , that added a new >> test around the same time as bug 1251920 became a problem. Forcing tempest >> to skip this test had a very high incidence of success without any 1251920 >> related failures. As a result we are working arond this bug by skipping that >> test, until it can be run without major impact to the gate. >> >> The change that introduced this problematic test had to go through the gate >> four times before it would merge, though only one of the 3 failed attemps >> appears to have triggered 1251920. Or as Jeremy Stanley (fungi) said >> "nondeterministic failures breed more nondeterministic failures, because >> people are so used to having to reverify their patches to get them to merge >> that they are doing so even when it's their patch which is introducing a >> nondeterministic bug." >> >> https://bugs.launchpad.net/bugs/1252170 "tempest.scenario >> test_resize_server_confirm failed in grenade" >> >> Fix https://review.openstack.org/#/c/57357/ >> >> Fix https://review.openstack.org/#/c/57572/ >> >> First we started running post Grenade upgrade tests in parallel (to fix >> another bug) which would normally be fine, but Grenade wasn't configuring >> the small flavors typically used by tempest so it was possible for the >> devstack Jenkins slaves to run out of memory when starting many larger VMs >> in parallel. To fix this devstack lib/tempest has been updated to create the >> flavors only if they don't exist and Grenade is allowing tempest to use its >> default instance flavors. >> >> >> >> Now that we have the gate back into working order, we are working on the >> next steps to prevent this from happening again. The two most immediate >> changes are: >> >> Doing a better job of triaging gate bugs >> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html >> ). >> >> In the next few days we will remove 'reverify no bug' (although you will >> still be able to run 'reverify bug x'. >> >> >> Best, >> Joe Gordon >> Clark Boylan >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > -- > Davanum Srinivas :: http://davanum.wordpress.com > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev