================================ Changes coming in gate structure ================================
Unless you've been living under a rock, on the moon, around Saturn, you'll have noticed that the gate has been quite backed up the last 2 weeks. Every time we get towards a milestone this gets measurably worse, and the expectation at is at i3 we're going to see at least 40% more load than we are dealing with now (if history is any indication), which doesn't bode well. It turns out, when you have a huge and rapidly growing Open Source project, you keep finding scaling limits in existing software, your software, and approaches in general. It also turns out that you find out that you need to act defensively on situations that you didn't think you'd have to worry about. Like code reviews with 3 month old test results being put into the review queue. Or code that *can't* pass (which a look at the logs would show) being reverified in the gate. All of these things compound on the fact that there are real bugs in OpenStack, which end up having a non linear failure effect. Once you get past a certain point the failure rates multiply to the point where everything stops (which happened Sunday, when we only merged 4 changes in 24 hrs). The history of the gate structure is a long one. It was added in Diablo when there was a project which literally would not run with the other OpenStack components. The idea of gating merge of everything on everything else is to ensure we have some understanding that OpenStack actually works, all together, for some set of configurations. It wasn't until Folsom cycle that we started running these tests before Human review (kind of amazing). The gate is also based on an assumption that most of the bugs we are catching are outside to project, vs. bugs that are already in the project. However, in an asynchronous system, bugs can show up only very occasionally, and get past our best efforts to detect them, then pile up in the code base until we rout them out. ========================================= Towards a Svelter Gate - Leaning on Check ========================================= We've got a current plan of attack to try to maintain nearly the same level of integration test guarantees, and hope to make it so on the merge side we're able to get more throughput. This is a set of things that all have to happen at once to not completely blow out the guarantees we've got in the source. Make a clean recent Check prereq for entering gate ================================================== A huge compounding problem has been patches that can't pass being promoted to the gate. So we're going to make Zuul able to enforce a recent clean check scorecard before going into the gate. Our working theory of recent is last 24hrs. If it doesn't have a recent set of check results on +A, we'll trigger a check rerun, and if clean, it gets sent to the gate. We'll also probably add a sweeper to zuul so it will refresh results on changes that are getting comments on them that are older than some number of days automatically. Svelt Gate ========== The gate jobs will be trimmed down immensely. Nothing project specific, so pep8 / unit tests all ripped out, no functional test runs. Less overall configs. Exactly how minimal we'll figure out as we decide what we can live without. The floor for this would be devstack-tempest-full and grenade. This is basically sanity check that the combination of patches in flight doesn't ruin the world for everyone. Idle Cloud for Elastic Recheck Bugs =================================== We have actually been using gate as double duty, both as ensuring integration, but also as a set of clean test results to figure out what bugs are in OpenStack that only show up from time to time. The check queue is way too noisy, as our system actually blocks tons of bad code from getting in. With the Svelt gate, we'll need a set of background nodes to build that dataset. But with elastic search we now have the technology, so this is good. It will let us work these issues in parallel. This issues will still cause people pain in getting clean results in check. ========================= Timelines, Dangers, and Opportunities ========================= We need changes soon. Every past experience is milestone 3 is 40% heavier than milestone 2, and nothing indicates that icehouse is going to be any different. So Jim's put getting these required bits into Zuul to the top of his list, and we're hoping we'll have them within a week. With this approach, wedging the gate is highly unlikely. However as we won't be testing every check test again in gate, it means there is a possibility that a combination of patches might make the check results wedge for everyone (like pg job gets wedged). So it moves that issue around. Right now it's hard to say if that particular issue will get better or worse. However the Sherlock rule of gate blocks remains in effect: once you've eliminated the impossible, any gate blocking scenario, however improbable, will eventually happen. It will mean that the human error of promoting non passing code to the gate will get stopped. That will help quite a bit. A few of us have been manually pruning those changes out of the gate, and that helped build up merge velocity again. The system will now work like we've seen it needs to. ========================== Executive Summary ========================== To summarize, the effects of these changes will be: - 1) Decrease the impact of failures resetting the entire gate queue by doing the heavy testing in the check queue where changes are not dependent on each other. - 2) Run a slimmer set of jobs in the gate queue to maintain sanity, but not block as much on existing bugs in OpenStack. - 3) As a result, this should increase our confidence that changes put into the gate will pass. This will help prevent gate resets, and the disruption they cause by needing to invalidate and restart the whole gate queue. And we'll be making getting this working a top priority, so we'll be ready for Icehouse-3. -- Sean Dague Samsung Research America [email protected] / [email protected] http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
