On Jun 14, 2014 11:12 AM, "Robert Collins" <robe...@robertcollins.net> wrote: > > You know its bad when you can't sleep because you're redesigning gate > workflows in your head.... so I apologise that this email is perhaps > not as rational, nor as organised, as usual - but , ^^^^. :) > > Obviously this is very important to address, and if we can come up > with something systemic I'm going to devote my time both directly, and > via resource-hunting within HP, to address it. And accordingly I'm > going to feel free to say 'zuul this' with no regard for existing > features. We need to get ahead of the problem and figure out how to > stay there, and I think below I show why the current strategy just > won't do that. > > On 13 June 2014 06:08, Sean Dague <s...@dague.net> wrote: > > > We're hitting a couple of inflection points. > > > > 1) We're basically at capacity for the unit work that we can do. Which > > means it's time to start making decisions if we believe everything we > > currently have running is more important than the things we aren't > > currently testing. > > > > Everyone wants multinode testing in the gate. It would be impossible to > > support that given current resources. > > How much of our capacity problems are due to waste - such as: > - tempest runs of code the author knows is broken > - tempest runs of code that doesn't pass unit tests > - tempest runs while the baseline is unstable - to expand on this > one, if master only passes one commit in 4, no check job can have a > higher success rate overall. > > Vs how much are an indication of the sheer volume of development being done? > > > 2) We're far past the inflection point of people actually debugging jobs > > when they go wrong. > > > > The gate is backed up (currently to 24hrs) because there are bugs in > > OpenStack. Those are popping up at a rate much faster than the number of > > people who are willing to spend any time on them. And often they are > > popping up in configurations that we're not all that familiar with. > > So, I *totally* appreciate that people fixing the jobs is the visible > expendable resource, but I'm not sure its the bottleneck. I think the > bottleneck is our aggregate ability to a) detect the problem and b) > resolve it. > > For instance - strawman - if when the gate goes bad, after a check for > external issues like new SQLAlchemy releases etc, what if we just > rolled trunk of every project that is in the integrated gate back to > before the success rate nosedived ? I'm well aware of the DVCS issues > that implies, but from a human debugging perspective that would > massively increase the leverage we get from the folk that do dive in > and help. It moves from 'figure out that there is a problem and it > came in after X AND FIX IT' to 'figure out it came in after X'. > > Reverting is usually much faster and more robust than rolling forward, > because rolling forward has more unknowns. > > I think we have a systematic problem, because this situation happens > again and again. And the root cause is that our time to detect > races/nondeterministic tests is a probability function, not a simple > scalar. Sometimes we catch such tests within one patch in the gate, > sometimes they slip through. If we want to land hundreds or thousands > of patches a day, and we don't want this pain to happen, I don't see > any way other than *either*: > A - not doing this whole gating CI process at all > B - making detection a whole lot more reliable (e.g. we want > near-certainty that a given commit does not contain a race) > C - making repair a whole lot faster (e.g. we want <= one test cycle > in the gate to recover once we have determined that some commit is > broken. > > Taking them in turn: > A - yeah, no. We have lots of experience with the axiom that that > which is not tested is broken. And thats the big concern about > removing things from our matrix - when they are not tested, we can be > sure that they will break and we will have to spend neurons fixing > them - either directly or as reviews from people fixing it. > > B - this is really hard. Say we want quite sure sure that there are no > new races that will occur with more than some probability in a given > commit, and we assume that race codepaths might be run just once in > the whole test matrix. A single test run can never tell us that - it > just tells us it worked. What we need is some N trials where we don't > observe a new race (but may observe old races), given a maximum risk > of the introduction of a (say) 5% failure rate into the gate. [check > my stats] > (1-max risk)^trials = margin-of-error > 0.95^N = 0.01 > log(0.01, base=0.95) = N > N ~= 90 > > So if we want to stop 5% races landing, and we may exercise any given > possible race code path a minimum of 1 times in the test matrix, we > need to exercise the whole test matrix 90 times to have that 1% margin > sure we saw it. Raise that to a 1% race: > log(0.01. base=0.99) = 458 > Thats a lot of test runs. I don't think we can do that for each commit > with our current resources - and I'm not at all sure that asking for > enough resources to do that makes sense. Maybe it does. > > Data point - our current risk, with 1% margin: > (1-max risk)^1 = 0.01 > 99% (that is, a single passing gate run will happily let through races > with any amount of fail, given enough trials). In fact, its really > just a numbers game for us at the moment - and we keep losing. > > B1. We could change our definition from a per-commit basis to instead > saying 'within a given number of commits we want the probability of a > new race to be low' - amortise the cost of gaining lots of confidence > over more commits. It might work something like: > - run regular gate runs of things of deeper and deeper zuul refs > - failures eject single commits as usual > - don't propogate successes. > - keep going until have 100 commits all validated but not propogated, > *or* more than (some time window, lets say 5 hours) has passed > - start 500 test runs of all those commits, in parallel > - if it fails, eject the whole window > - otherwise let it in. > > This might let races in individual commits within the window through > if and only if they are also fixed within the same window; coarse > failures like basic API incompatibility or failure to use deps right > would be detected as they are today. There's obviously room for > speculative execution on the whole window in fact: run 600 jobs, 100 > the zuul ref build-up and 500 the confidence interval builder. > > The downside of this approach is that there is a big window (because > its amortising a big expense) which will all go in together, or not at > all. And we'd have to prevent *all those commits* from being > resubmitted until the cause of the failure was identified and actively > fixed. We'd want that to be enforced, not run on the honour system, > because any of those commits can bounce the whole set out. The flip > side is that it would be massively more effective at keeping bad > commits out. > > B2. ??? I had some crazy idea of multiple branches with more and more > confidence in them, but I think they all actually boil down to > variations on a them of B1, and if we move the centre of developer > mass to $wherever, the gate for that is where the pain will be felt. > > C - If we can't make it harder to get races in, perhaps we can make it > easier to get races out. We have pretty solid emergent statistics from > every gate job that is run as check. What if set a policy that when a > gate queue gets a race: > - put a zuul stop all merges and checks on all involved branches > (prevent further damage, free capacity for validation) > - figure out when it surfaced > - determine its not an external event > - revert all involved branches back to the point where they looked > good, as one large operation > - run that through jenkins N (e.g. 458) times in parallel.
Do we have enough compute resources to do this? > - on success land it > - go through all the merges that have been reverted and either > twiddle them to be back in review with a new patchset against the > revert to restore their content, or alternatively generate new reviews > if gerrit would make that too hard. > > > Getting folk to help > ============== > > On the social side there is currently very little direct signalling > that the gate is in trouble : I don't mean there is no communication - > there's lots. What I mean is that Fred, a developer not on the lists > or IRC for whatever reason, pushing code up, has no signal until they > go 'why am I not getting check results', visit the status page and go > 'whoa'. > > Maybe we can do something about that. For instance, when a gate is in > trouble, have zuul not schedule check jobs at all, and refuse to do > rechecks / revalidates in affected branches, unless the patch in > question is a partial-bug: or bug: for one of the gate bugs. Zuul can > communicate the status on the patch, so the developer knows. > This will: > - free up capacity for testing whatever fix is being done for the issue > - avoid waste, since we know there is a high probability of spurious failures > - provide a clear signal that the project expectation is that when > the gate is broken, fixing it is the highest priority > > > Landing a gating job comes with maintenance. Maintenance in looking into > > failures, and not just running recheck. So there is an overhead to > > testing this many different configurations. > > > > I think #2 is just as important to realize as #1. As such I think we > > need to get to the point where there are a relatively small number of > > configurations that Infra/QA support, and beyond that every job needs > > sponsors. And if the job success or # of uncategorized fails go past > > some thresholds, we demote them to non-voting, and if you are non-voting > > for > 1 month, you get demoted to experimental (or some specific > > timeline, details to be sorted). > > > -- > Robert Collins <rbtcoll...@hp.com> > Distinguished Technologist > HP Converged Cloud > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev