On Wed, Jan 22, 2014 at 1:39 PM, Sean Dague <[email protected]> wrote: > ================================ > Changes coming in gate structure > ================================ > > Unless you've been living under a rock, on the moon, around Saturn, > you'll have noticed that the gate has been quite backed up the last 2 > weeks. Every time we get towards a milestone this gets measurably > worse, and the expectation at is at i3 we're going to see at least 40% > more load than we are dealing with now (if history is any indication), > which doesn't bode well. > > It turns out, when you have a huge and rapidly growing Open Source > project, you keep finding scaling limits in existing software, your > software, and approaches in general. It also turns out that you find > out that you need to act defensively on situations that you didn't > think you'd have to worry about. Like code reviews with 3 month old > test results being put into the review queue. Or code that *can't* > pass (which a look at the logs would show) being reverified in the > gate. > > All of these things compound on the fact that there are real bugs in > OpenStack, which end up having a non linear failure effect. Once you > get past a certain point the failure rates multiply to the point where > everything stops (which happened Sunday, when we only merged 4 changes > in 24 hrs). > > The history of the gate structure is a long one. It was added in > Diablo when there was a project which literally would not run with > the other OpenStack components. The idea of gating merge of everything > on everything else is to ensure we have some understanding that > OpenStack actually works, all together, for some set of > configurations. > > It wasn't until Folsom cycle that we started running these tests before > Human review (kind of amazing). > > The gate is also based on an assumption that most of the bugs we are > catching are outside to project, vs. bugs that are already in the > project. However, in an asynchronous system, bugs can show up only > very occasionally, and get past our best efforts to detect them, then > pile up in the code base until we rout them out. > > ========================================= > Towards a Svelter Gate - Leaning on Check > ========================================= > > We've got a current plan of attack to try to maintain nearly the same > level of integration test guarantees, and hope to make it so on the > merge side we're able to get more throughput. This is a set of things > that all have to happen at once to not completely blow out the > guarantees we've got in the source. > > Make a clean recent Check prereq for entering gate > ================================================== > > A huge compounding problem has been patches that can't pass being > promoted to the gate. So we're going to make Zuul able to enforce a > recent clean check scorecard before going into the gate. Our working > theory of recent is last 24hrs. > > If it doesn't have a recent set of check results on +A, we'll trigger > a check rerun, and if clean, it gets sent to the gate. > > We'll also probably add a sweeper to zuul so it will refresh results > on changes that are getting comments on them that are older than some > number of days automatically. > > Svelt Gate > ========== > > The gate jobs will be trimmed down immensely. Nothing project > specific, so pep8 / unit tests all ripped out, no functional test > runs. Less overall configs. Exactly how minimal we'll figure out as we > decide what we can live without. The floor for this would be > devstack-tempest-full and grenade. > > This is basically sanity check that the combination of patches in > flight doesn't ruin the world for everyone. > > Idle Cloud for Elastic Recheck Bugs > =================================== > > We have actually been using gate as double duty, both as ensuring > integration, but also as a set of clean test results to figure out > what bugs are in OpenStack that only show up from time to time. The > check queue is way too noisy, as our system actually blocks tons of > bad code from getting in. > > With the Svelt gate, we'll need a set of background nodes to build > that dataset. But with elastic search we now have the technology, so > this is good. > > It will let us work these issues in parallel. This issues will still > cause people pain in getting clean results in check. > > ========================= > Timelines, Dangers, and Opportunities > ========================= > > We need changes soon. Every past experience is milestone 3 is 40% > heavier than milestone 2, and nothing indicates that icehouse is going > to be any different. So Jim's put getting these required bits into > Zuul to the top of his list, and we're hoping we'll have them within a > week. > > With this approach, wedging the gate is highly unlikely. However as we > won't be testing every check test again in gate, it means there is a > possibility that a combination of patches might make the check results > wedge for everyone (like pg job gets wedged). So it moves that issue > around. Right now it's hard to say if that particular issue will get > better or worse. However the Sherlock rule of gate blocks remains in > effect: once you've eliminated the impossible, any gate blocking > scenario, however improbable, will eventually happen. > > It will mean that the human error of promoting non passing code to the > gate will get stopped. That will help quite a bit. A few of us have > been manually pruning those changes out of the gate, and that helped > build up merge velocity again. The system will now work like we've > seen it needs to. > > ========================== > Executive Summary > ========================== > To summarize, the effects of these changes will be: > > - 1) Decrease the impact of failures resetting the entire gate queue > by doing the heavy testing in the check queue where changes are not > dependent on each other. > > - 2) Run a slimmer set of jobs in the gate queue to maintain sanity, > but not block as much on existing bugs in OpenStack. > > - 3) As a result, this should increase our confidence that changes > put into the gate will pass. This will help prevent gate resets, > and the disruption they cause by needing to invalidate and restart > the whole gate queue. > > And we'll be making getting this working a top priority, so we'll be > ready for Icehouse-3. > > -- > Sean Dague > Samsung Research America > [email protected] / [email protected] > http://dague.net > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > Sounds like a great strategy Sean, yourself and everyone involved feel free to grab me on IRC if there's anything I can help with.
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
